Aggregate

Aggregate #

The following measures for Aggregate objects are implemented:

Global similarity measures are defined by applying an aggregation function \(\Phi\) to the local similarity values. Such aggregation functions are defined by determining

  • a basic aggregation function and
  • a weight model that determines weights \(\overline{\omega}=(\omega_1,\ldots,\omega_n)\) such that \(0 \leq \omega_i \leq 1\) and \(\sum_{i=1...n} \omega_i = 1\) .

To guarantee that \(\sum_{i=1...n} \omega_i = 1\) all weights will be normalized automatically during runtime.

Global parameters of all aggregate similarity measures #

Parameter Type/Range Default Value Description
defaultWeight Weight (Double, [0,1]) 1.0 The default weight for attributes that is used if no specific weight is defined.
ignoreNullAttributesInQuery Flag (Boolean) true The parameter specifies whether missing/unspecified attributes (null attributes) in aggregate objects contained in a query should be ignored. If set to false, these null attributes are not ignored and thus have to be also null in the corresponding case objects to be evaluated with a similarity of 1.0. This is particularly useful for finding semantically equal objects.

Average #

The weighted average is the most typically used aggregation function. Each attribute contributes to the similarity (and thereby to the utility) in a way that is proportional to its weight.

Similarity: \(\Phi(s_1,\ldots,s_n)=\sum_{i=1}^n \omega_i \cdot s_i\)

For example, an Aggregate Average measure can look like:

<AggregateAverage name="AggregateAverageEmployee" class="Employee" defaultWeight="0.5">
   <AggWeight att="room" weight="0.3"/>
</AggregateAverage>

This measure would refer to the aggregate class Employee, which contains two attributes: name (String) and room (Integer). There is no given weight for name, so it uses the default weight of 0.5, while the weight for room is set as 0.3.

To create this measure during runtime, the following code would be used:

SMAggregateAverageImpl aggregateAverageEmployee = (SMAggregateAverageImpl) simVal.getSimilarityModel().createSimilarityMeasure(SMAggregateAverage.NAME, ModelFactory.getDefaultModel().getAggregateSystemClass());
aggregateAverageEmployee.setDataClass(ModelFactory.getDefaultModel().getClass("Employees"));
aggregateAverageEmployee.setDefaultWeight(0.5);
aggregateAverageEmployee.setWeight("room", 0.3);
simVal.getSimilarityModel().addSimilarityMeasure(aggregateAverageEmployee, "AggregateAverageEmployee");

For example, there are two aggregate objects of the class Employee. The first one has the values name=“TestEmployee” and room=123. The second one has the values name=“TestEmployee” and room=321. The normalized weights are 0.625 for name ( \(\frac{0.5}{0.5+0.3}\) ) and 0.375 for room ( \(\frac{0.3}{0.5+0.3}\) ).

Using simple measures for the attributes, the similarity between the names is 1.0 and between the rooms 0.0. For the total similarity, the weighted average of both similarities is used. So, the similarity is \(0.625 * 1.0 + 0.375 * 0.0 = 0.625\) .

Please note the following when specifying weights: All attributes having an equal weight of e.g. 1 are equally weighted in the similarity computation. To increase the importance of one attribute please choose a value greater than 1 and to decrease the importance choose values smaller than 1. Weights will be automatically normalized (i.e., the sum equals 1.0) during similarity computation. However, they also normalized weights can be set manually. Please note that SMAggregateAverage aggregates only attributes that are contained in a query aggregate object. Consequently, the original ratio of weights might be changed during normalization if attributes are missing.

Maximum #

When using the maximum aggregation, the overall similarity is determined by the maximum local similarity. If one attribute indicates a high utility, the utility for the whole case is high.

Similarity: \(\Phi(s_1,\ldots,s_n)=\max_{i=1}^n \omega_i \cdot s_i\)

For example, an Aggregate Maximum measure can look like:

<AggregateMaximum name="AggregateMaximumEmployees" class="Employee" defaultWeight="0.5">
   <AggWeight att="room" weight="0.3"/>
</AggregateMaximum>

This measure would refer to the aggregate class Employee, which contains two attributes: name (String) and room (Integer). There is no given weight for name, so it uses the default weight of 0.5, while the weight for room is set as 0.3.

To create this measure during runtime, the following code would be used:

SMAggregateMaximumImpl aggregateMaximumEmployees = (SMAggregateMaximumImpl) simVal.getSimilarityModel().createSimilarityMeasure(SMAggregateMaximum.NAME, ModelFactory.getDefaultModel().getAggregateSystemClass());
aggregateMaximumEmployees.setDataClass(ModelFactory.getDefaultModel().getClass("Employee"));
aggregateMaximumEmployees.setDefaultWeight(0.5);
aggregateMaximumEmployees.setWeight("room", 0.3);
simVal.getSimilarityModel().addSimilarityMeasure(aggregateMaximumEmployees, "AggregateMaximumEmployees");

For example, there are the same two aggregate objects of the class Employee, which were used before. The first one has the values name=“TestEmployee” and room=123. The second one has the values name=“TestEmployee” and room=321. The normalized weights are 0.625 for name and 0.375 for room. For this measure, the weights are used to sort the similarity values by doing a multiplication of the weight and the similarity value. As result of the similarity computation only the values without weights are returned. Using simple measures, e.g. StringEquals, for the attributes, the similarity between the names is 1.0, between the rooms 0.0. As 1.0 is the highest similarity, this value would be returned.

K-Maximum #

The k highest local similarity value determines the global similarity. Hence, if k out of n attributes indicate a high utility, the overall utility for the whole case is high.

Similarity: \(\begin{array}{l} \Phi(s_1,\ldots,s_n)=\omega_{i_k} \cdot s_{i_k} \\ \quad \text{with}\, \omega_{i_r}\cdot s_{i_r} \geq \omega_{i_{r+1}}\cdot s_{i_{r+1}} \end{array}\)

The following parameters can be set for this similarity measure.

Parameter Type/Range Default Value Description
k Attribute number (int) 1 The parameter is used to define, which k most similar case is returned. Only positive values greater than 0 are allowed. If k is set to a value higher than the number of attributes, an exception is thrown.

For example, an Aggregate K Maximum measure can look like:

<AggregateKMaximum name="AggregateKMaximumEmployees" class="Employee" k="2" default="false" defaultWeight="0.5">
   <AggWeight att="room" weight="0.3"/>
</AggregateKMaximum>

This measure would refer to the aggregate class Employee, which contains two attributes: name (String) and room (Integer). There is no given weight for name, so it uses the default weight of 0.5, while the weight for room is set as 0.3. The value for k is set to 2, so the second highest similarity will be returned. In this case, it’s the similarity value for the attribute room, which is 0.3. So, the global similarity will be 0.3, too.

To create this measure during runtime, the following code would be used:

SMAggregateKMaximumImpl aggregateKMaximumEmployee = (SMAggregateKMaximumImpl) simVal.getSimilarityModel().createSimilarityMeasure(SMAggregateKMaximum.NAME, ModelFactory.getDefaultModel().getAggregateSystemClass());
aggregateKMaximumEmployee.setDataClass(ModelFactory.getDefaultModel().getClass("Employee"));
aggregateKMaximumEmployee.setDefaultWeight(0.5);
aggregateKMaximumEmployee.setK(2);
simVal.getSimilarityModel().addSimilarityMeasure(aggregateKMaximumEmployee, "AggregateKMaximumEmployees");

For example, there are the same two aggregate objects of the class Employee, which were used before. The first one has the values name=“TestEmployee” and room=123. The second one has the values name=“TestEmployee” and room=321. The normalized weights are 0.625 for name and 0.375 for room. For this measure, the weights are used to sort the local similarity values by doing a multiplication of the weight and the similarity value. As result of the similarity computation only the values without weights are returned. Using simple measures, e.g. StringEquals, for the attributes, the similarity between the names is 1.0, between the rooms 0.0. As 0.0 is the second highest similarity, the value 0.0 would be returned.

As another example, assume two aggregate objects of any class containing five attributes. The following similarities were calculated: 1.0, 0.8, 0.6, 0.4 and 0.2. Using the k-value of 2, the second highest value of 0.8 is returned. Using the k-value of 4, the fourth highest value of 0.4 is returned. For a k-value of 1, the highest value of 1.0 is returned, analogous to AggregateMaximum, and for k=5 the smallest value 0.2, corresponding to AggregateMinimum.

Minimum #

When using the minimum aggregation, the overall similarity is determined by the minimum local similarity. If one attribute indicates a low utility, the utility for the whole case is low.

Similarity: \(\Phi(s_1,\ldots,s_n)=\min_{i=1}^n \omega_i \cdot s_i\)

For example, an Aggregate Minimum measure can look like:

<AggregateMinimum name="AggregateMinimumEmployees" class="Employee"defaultWeight="0.5">
   <AggWeight att="room" weight="0.3"/>
</AggregateMinimum>

This measure would refer to the aggregate class Employee, which contains two attributes: name (String) and room (Integer). There is no given weight for name, so it uses the default weight of 0.5, while the weight for room is set as 0.3.

To create this measure during runtime, the following code would be used:

SMAggregateMinimumImpl aggregateMinimumEmployees = (SMAggregateMinimumImpl) simVal.getSimilarityModel().createSimilarityMeasure(SMAggregateMinimum.NAME, ModelFactory.getDefaultModel().getAggregateSystemClass());
aggregateMinimumEmployees.setDataClass(ModelFactory.getDefaultModel().getClass("Employee"));
aggregateMinimumEmployees.setDefaultWeight(0.5);
aggregateMinimumEmployees.setWeight("room", 0.3");
simVal.getSimilarityModel().addSimilarityMeasure(aggregateMinimumEmployees, "AggregateMinimumEmployees");

For example, there are the same two aggregate objects of the class Employee, which were used before. The first one has the values name=“TestEmployee” and room=123. The second one has the values name=“TestEmployee” and room=321. The normalized weights are 0.625 for name and 0.375 for room. For this measure, the weights are used to sort the similarity values by doing a multiplication of the weight and the similarity value. As result of the similarity computation only the values without weights are returned. Using simple measures, e.g. StringEquals, for the attributes, the similarity between the names is 1.0, between the rooms 0.0. As 0.0 is the lowest similarity, this value would be returned.

K-Minimum #

The k lowest local similarity value determines the global similarity. Hence, if k out of n attributes indicate a low utility, the overall utility for the whole case is low.

Similarity: \(\begin{array}{l} \Phi(s_1,\ldots,s_n)=\omega_{i_k} \cdot s_{i_k} \\ \quad \text{with}\, \omega_{i_r}\cdot s_{i_r} \leq \omega_{i_{r+1}}\cdot s_{i_{r+1}} \end{array}\)

The following parameters can be set for this similarity measure.

Parameter Type/Range Default Value Description
k Attribute number (int) 1 The parameter is used to define, which k lowest similar case is returned. Only positive values greater than 0 are allowed. If k is set to a value higher than the number of attributes, an exception is thrown.

For example, an Aggregate K Minimum measure can look like:

<AggregateKMinimum name="AggregateKMinimumEmployees" class="Employee" k="2" default="false" defaultWeight="0.5">
   <AggWeight att="room" weight="0.3"/>
</AggregateKMinimum>

This measure would refer to the aggregate class Employee, which contains two attributes: name (String) and room (Integer). There is no given weight for name, so it uses the default weight of 0.5, while the weight for room is set to 0.3. The value for k is set to 2, so the second lowest similarity will be returned. In this case, it’s the similarity value for the attribute name, which is 0.5. So, the global similarity will be 0.5, too.

To create this measure during runtime, the following code would be used:

SMAggregateKMinimumImpl aggregateKMinimumEmployee = (SMAggregateKMinimumImpl) simVal.getSimilarityModel().createSimilarityMeasure(SMAggregateKMinimum.NAME, ModelFactory.getDefaultModel().getAggregateSystemClass());
aggregateKMinimumEmployee.setDataClass(ModelFactory.getDefaultModel().getClass("Employee"));
aggregateKMinimumEmployee.setDefaultWeight(0.5);
aggregateKMinimumEmployee.setK(2);
simVal.getSimilarityModel().addSimilarityMeasure(aggregateKMinimumEmployee, "AggregateKMinimumEmployees");

For example, there are the same two aggregate objects of the class Employee, which were used before. The first one has the values name=“TestEmployee” and room=123. The second one has the values name=“TestEmployee” and room=321. The normalized weights are 0.625 for name and 0.375 for room. For this measure, the weights are used to sort the local similarity values by doing a multiplication of the weight and the similarity value. As result of the similarity computation only the values without weights are returned. Using simple measures, e.g. StringEquals, for the attributes, the similarity between the names is 1.0, between the rooms 0.0. As 1.0 is the second lowest similarity, this value would be returned.

As another example, assume two aggregate objects of any class containing five attributes. The following similarities were calculated: 1.0, 0.8, 0.6, 0.4 and 0.2. Using the k-value of 2, the second lowest value of 0.4 is returned. Using the k-value of 4, the fourth lowest value of 0.8 is returned. For a k-value of 1, the lowest value of 1.0 is returned, analogous to AggregateMinimum, and for k=5 the highest value of 1.0, corresponding to AggregateMaximum.

Minkowski #

The Minkowski aggregation is a generalization of the weighted average. The higher the value of the parameter p ≥ 1, the higher is the influence of the attribute with the highest local similarity. For p to infinity this aggregation function becomes the maximum aggregation. If no weights are given, the similarity will be \(1.0\) .

Similarity: \(\Phi(s_1,\ldots,s_n)=(\sum_{i=1}^n \omega_i \cdot s_i^p)^{1/p}\)

The following parameters can be set for this similarity measure.

Parameter Type/Range Default Value Description
p Weight (double) 2.0 The parameter is used to define the influence of the attribute with the highest local similarity.

For example, an Aggregate Minkowski measure can look like:

<AggregateMinkowski name="AggregateMinkowskiEmployee" class="Employee" p="3" defaultWeight="0.5">
   <AggWeight att="room" weight="0.3"/>
</AggregateMinkowski>

This measure would refer to the aggregate class Employee, which contains two attributes: name (String) and room (Integer). There is no given weight for name, so it uses the default weight of 0.5, while the weight for room is set to 0.3. The value for p is set to 3.

To create this measure during runtime, the following code would be used:

SMAggregateMinkowskiImpl aggregateMinkowskiEmployee = (SMAggregateMinkowskiImpl) simVal.getSimilarityModel().createSimilarityMeasure(SMAggregateMinkowski.NAME, ModelFactory.getDefaultModel().getAggregateSystemClass());
aggregateMinkowskiEmployee.setDataClass(ModelFactory.getDefaultModel().getClass("Employee"));
aggregateMinkowskiEmployee.setMinkowskiP(3);
aggregateMinkowskiEmployee.setDefaultWeight(0.5);
aggregateMinkowskiEmployee.setWeight("room", 0.3);
 simVal.getSimilarityModel().addSimilarityMeasure(aggregateMinkowskiEmployee, "AggregateMinkowskiEmployee");

For example, there are the same two aggregate objects of the class Employee, which were used before. The first one has the values name=“TestEmployee” and room=123. The second one has the values name=“TestEmployee” and room=321. The normalized weights are 0.625 for name and 0.375 for room. Using simple measures for the attributes, the similarity between the names is 1.0, between the rooms 0.0.

For the total similarity, the following calculation results: \(((0.625 * 1.0)^3 + (0.375 * 0.0)^3)^{\frac{1}{3}} = ((0.625)^3 + (0.0)^3)^{\frac{1}{3}} = (0.2441 + 0.0) ^{\frac{1}{3}} = 0.625 \) .

Euclidian #

The Euclidean aggregation is the same as Minkowski aggregation with a fixed p = 2. If no weights are given, the similarity will be also \(1.0\) .

Similarity: \(\Phi(s_1,\ldots,s_n)=(\sum_{i=1}^n \omega_i \cdot s_i^2)^{1/2}\)

For example, an Aggregate Euclidean measure can look like:

<AggregateEuclidian name="AggregateEuclidianEmployee" class="Employees" defaultWeight="0.5">
   <AggWeight att="room" weight="0.3"/>
</AggregateEuclidian>

This measure would refer to the aggregate class Employee, which contains two attributes: name (String) and room (Integer). There is no given weight for name, so it uses the default weight of 0.5, while the weight for room is set as 0.3.

To create this measure during runtime, the following code would be used:

SMAggregateEuclidianImpl aggregateEuclidianEmployee = (SMAggregateEuclidianImpl) simVal.getSimilarityModel().createSimilarityMeasure(SMAggregateEuclidian.NAME, ModelFactory.getDefaultModel().getAggregateSystemClass());
aggregateEuclidianEmployee.setDataClass(ModelFactory.getDefaultModel().getClass("Employees"));
aggregateEuclidianEmployee.setDefaultWeight(0.5);
aggregateEuclidianEmployee.setWeight("room", 0.3);
simVal.getSimilarityModel().addSimilarityMeasure(aggregateEuclidianEmployee, "AggregateEuclidianEmployee");

For example, there are the same two aggregate objects of the class Employee, which were used before. The first one has the values name=“TestEmployee” and room=123. The second one has the values name=“TestEmployee” and room=321. The normalized weights are 0.625 for name and 0.375 for room. Using simple measures for the attributes, the similarity between the names is 1.0, between the rooms 0.0.

For the total similarity, the following calculation results: \(((0.625 * 1.0)^2 + (0.375 * 0.0)^2)^{\frac{1}{2}} = ((0.625)^2 + (0.0)^2)^{\frac{1}{2}} = (0.3906 + 0.0) ^{\frac{1}{2}} = 0.625 \) .