Package de.uni_trier.wi2.procake.similarity

Data similarity package description

Similarity Model

A similarity model is defined on top of the data Model. Its goal is to describe the similarity between objects based on the common DataClass. This similarity description is called similiarity measure and is used in CBR based retrieval engines. The SimilarityModel

Similarity Measures

The SimilarityModel already contains about 30 SimilarityMeasures like Hamming Distance, Simple Matching Coefficient, or sigmoid function. More information about the SimilarityMeasures available is provided by Bergmann [Ber02].

A SimilarityMeasure is a mathematical function that calculates the Similarity between two DataObjects. In order to compare two DataObjects, a SimilarityMeasure for the common DataClass of both DataObjectsis used. Thus, the following enumeration is structured according to the DataClasses. It only gives an overview of the available SimilarityMeasures; the exact definition of each SimilarityMeasure can be found in the JavaDoc Documentation.

AggregateClass
The Similarity between two AggregateObjects is calculated by Aggregation Functions, that calculate the global similarity by combining the local similarity values of the Aggregates' attributes. Currently, eight concrete SimilarityMeasures are implemented and grouped by the help of the AbstractSimilarityMeasure SMAggregate that determines the characteristics of all measures within the group. The concrete measures are: SMAggregateAverage, SMAggregateEuclidian, SMAggregateWeighted, SMAggregateMinkowski, SMAggregateMaximum, SMAggregateMinimum, SMAggregateKMaximum, and SMAggregateKMinimum.
AtomicClass
The similarity calculation between two AtomicObjects can be determined by several SimilarityMeasures. Currently, 13 concrete SimilarityMeasures are implemented and grouped by the help of three AbstractSimilarityMeasures that determine the characteristics of all measures within the group. The table summarizes them and specifies the predicates that have to be fulfilled in order to use the SimilarityMeasure. These predicates mostly concern the order that must be available for the instances of the concerned DataClass. The used predicates are specified by the following abbreviations and Sympols:
NO
This means "Natural Order". Chronological and Numeric DataClasses already have a predefined, total order, so that the user does not need to model an own order for those DataClasses (although he or she may do so).
IIP
This means "Instance Interval Predicate". For user-defined chronologic and numeric DataClasses ( UserClasses that are sub-classes of ChronologicClass or NumericClass) the interval of possible attribute values may be taken into account during similarity assesment. For example, the NumericObjects "10" and "89" may be very dissimilar, if the attribute represents an age (where possible values are 0-110), but they may be similar, if the attribute represents a price (where possible values are 1-10000).
IEP
This means "Instance Enumeration Predicate". This predicate can be specified for UserClasses and means, that the number of its instances (possible values) must be finite, so that it is possible to enumerate them all.
IEP/TO
This means "Instance Enumeration Predicate/Total Order". In that case, an Instance Enumeration Predicate must be specified and a total order must exist for the possible values.
IEP/TXO
This means "Instance Enumeration Predicate/Taxonomy Order". In that case, that an Instance Enumeration Predicate must be specified and a taxonomy order must exist for the possible values.
-
This means that the SimilarityMeasure is not applicable for the concerned DataClass.
forAll
This means that the SimilarityMeasure is applicable for the concerned DataClass without any constraints.
Abstract Concrete BooleanClass ByteArrayClass ChronologicClass NumericClass StringClass
SMNumeric SMNumericExponential, SMNumericLinear, SMNumericSigmoid, SMNumericThreshold IEP/TO - NO, IIP, IEP/TO NO, IIP, IEP/TO IEP/TO
SMString SMStringEqual, SMStringRegexp, SMStringTermCount, SMStringWildcard - - - - forAll
SMTaxonomy SMTaxonomyClassic, SMTaxonomyClassicUserWeights, SMTaxonomyNodeHeight, SMTaxonomyPath IEP/TXO - IEP/TXO IEP/TXO IEP/TXO
SMTableDataObject IEP - IEP IEP IEP
CollectionClass
The Similarity between two CollectionObjects (Sets or Lists) can be calculated by two SimilarityMeasures, namely SMCollectionIsolatedMapping and SMCollectionMapping. Also, there is one measure, which just works on lists, namely SMListMapping.
IntervalClass
The Similarity between two IntervalObjects can be calculated by the SimilarityMeasure SMInterval.
DataClass
A very general SimilarityMeasure called SMTableDataClass can be used in case that no other SimilarityMeasure fits to the DataObjects that should be compared.

Vague Knowledge

In situations where no value is assigned to an attribute it is necessary to distinguish different strategies to compute the Similarity. Three kinds of strategies can be distinguished. The chosen strategy must be stated in the domain-specific SimilarityModel.

Optimistic Strategy
In an optimistic strategy it is assumed that unknown values argue for similarity (Constant: OPTIMISTIC)
Pessimistic Strategy
In a pessimistic strategy it is assumed that unknown values argue against similarity (Constant: PESSIMISTIC).
Average Strategy
In an average strategy it is assumed that unknown values argue for similarity value of E. This strategy requires calculating an expectancy value E which is not always possible (Constant: AVERAGE).

Knowledge Types

For IntervalObjects one can distinguish the two knowledge types imprecise knowledge and precise knowledge. This is necessary, because intervals in the query as well as in the cases may have different semantics: The semantics in the query can be that the user specifies an interval, because the concrete value does not matter to him or her and every value within it would be acceptable (imprecise knowledge) or because he or she searches for a case, covering it completely (precise knowledge). The semantics in the case can be that either all values are supported (precise knowledge) or one or several are supported but it is not known which ones (imprecise knowledge). The choosen Knowledge Type must be stated in the domain-specific SimilarityModel.

Precise Knowledge
If the semantics of an interval is that all specified values are addressed and must be true simultaneously we speak of precise knowledge (Constant: PRECISE)
Imprecise Knowledge
If the semantics of an interval is that only a subset of values is addressed we speak of imprecise knowledge (Constant: IMPRECISE).

Collection Types

For CollectionObjects one can distinguish the three collection types case inclusion, query inclusion, and intersection. The choosen Collection Type must be stated in the domain-specific SimilarityModel. For the following definitions, assume that two CollectionObjects (one in the query and one in the case) have to be compared.

Case Inclusion
The Similarity is the relation between the number of similar elements in the query collection and case collection and the number of elements in the case collection (Constant: CASE_INCLUSION).
Query Inclusion
The Similarity is the relation between the number of similar elements in the query collection and case collection and the number of elements in the query collection (Constant: QUERY_INCLUSION).
Intersection
The Similarity is the relation between the number of similar elements in the query collection and case collection and the number of different elements of both collections (Constant: INTERSECTION).

Similarity and Utility

Many retrieval methods follow a simple text matching approach without using knowledge about the semantics behind. The consequences may be fatal. For example, if an available product matches the requirements of the requested product but is not proposed by the retrieval service, the opportunity for improving efficiency, quality, or time-to-market is lost. On the other hand, if a proposed product is not sufficient resources are wasted. CBR is able to minimise both potential failure classes but it strongly depends on the quality of the similarity computation.

The goal of the similarity measures is to approximate the utility. Utility describes how helpful a solution is for a new problem. Thereby the traditional binary perception (true / false) is replaced by a finer grading. The approximation through the similarity measures bases on the assumption that similar problems have similar solutions or in other words, the solution of a problem is also useful for similar problems.

The problem descriptions of previous experiences get successively compared to the new problem by using the similarity function. The resulting similarity values are used to rank the cases in the case-base and the most similar cases are used to adapt their solutions to a new solution for the new problem. Consequently, a similarity measure sim is a function that expresses the similarity between two problem descriptions as a numerical value. The higher the value, the higher is the similarity between the two cases.

Depending on the representation format several similarity measures can be defined. For the attribute value representation the local-global principle is a well established technique. Thereby, the similarity calculation is performed by combining similarity values for each single attribute of the attribute space.

The local similarity values are used by the global similarity measure to aggregate them into one similarity value for all attributes. Thereby, the importance, relevance, and utility aspect of each attribute is encoded into the function.

A detailed description of the local-global principle and an overview of concrete similarity measures is presented by Bergmann [Ber02].

References

Ber02
Ralph Bergmann,Experience Management - Foundations, Development Methodology, and Internet-Based Applications, Lecture Notes in Artificial Intelligence 2432, Springer Berlin, Heidelberg, New York, Hong Kong, London, Milan, paris, Tokyo, 2002