Package de.uni_trier.wi2.procake.similarity
Data similarity package description
Similarity Model
A similarity model is defined on top of the data Model
. Its
goal is to describe the similarity between objects based on the common DataClass
. This similarity description is called similiarity
measure and is used in CBR based retrieval engines. The SimilarityModel
Similarity Measures
The SimilarityModel
already contains about 30
SimilarityMeasures like Hamming Distance, Simple
Matching Coefficient, or sigmoid function. More information about the SimilarityMeasures available
is provided by Bergmann [Ber02].
A SimilarityMeasure is a mathematical function that calculates the Similarity between two DataObjects. In order to compare two DataObjects, a SimilarityMeasure for the common DataClass of both DataObjectsis used. Thus, the following enumeration is structured according to the DataClasses. It only gives an overview of the available SimilarityMeasures; the exact definition of each SimilarityMeasure can be found in the JavaDoc Documentation.
AggregateClass
- The Similarity between two AggregateObjects is calculated by Aggregation
Functions, that calculate the global similarity by combining the local similarity
values of the Aggregates' attributes. Currently, eight concrete
SimilarityMeasures are implemented and grouped by the help of the
AbstractSimilarityMeasure
SMAggregate
that determines the characteristics of all measures within the group. The concrete measures are:SMAggregateAverage
,SMAggregateEuclidian
,SMAggregateWeighted
,SMAggregateMinkowski
,SMAggregateMaximum
,SMAggregateMinimum
,SMAggregateKMaximum
, andSMAggregateKMinimum
. -
AtomicClass
- The similarity calculation between two AtomicObjects can be determined by
several SimilarityMeasures. Currently, 13 concrete SimilarityMeasures are
implemented and grouped by the help of three AbstractSimilarityMeasures that
determine the characteristics of all measures within the group. The table
summarizes them and specifies the predicates that have to be fulfilled in order to use
the SimilarityMeasure. These predicates mostly concern the order that must be
available for the instances of the concerned DataClass. The used predicates are
specified by the following abbreviations and Sympols:
- NO
- This means "Natural Order". Chronological and Numeric DataClasses already have a predefined, total order, so that the user does not need to model an own order for those DataClasses (although he or she may do so).
- IIP
- This means "Instance Interval Predicate". For user-defined chronologic and numeric DataClasses ( UserClasses that are sub-classes of ChronologicClass or NumericClass) the interval of possible attribute values may be taken into account during similarity assesment. For example, the NumericObjects "10" and "89" may be very dissimilar, if the attribute represents an age (where possible values are 0-110), but they may be similar, if the attribute represents a price (where possible values are 1-10000).
- IEP
- This means "Instance Enumeration Predicate". This predicate can be specified for UserClasses and means, that the number of its instances (possible values) must be finite, so that it is possible to enumerate them all.
- IEP/TO
- This means "Instance Enumeration Predicate/Total Order". In that case, an Instance Enumeration Predicate must be specified and a total order must exist for the possible values.
- IEP/TXO
- This means "Instance Enumeration Predicate/Taxonomy Order". In that case, that an Instance Enumeration Predicate must be specified and a taxonomy order must exist for the possible values.
- -
- This means that the SimilarityMeasure is not applicable for the concerned DataClass.
- forAll
- This means that the SimilarityMeasure is applicable for the concerned DataClass without any constraints.
Abstract Concrete BooleanClass ByteArrayClass ChronologicClass NumericClass StringClass SMNumeric
SMNumericExponential
,SMNumericLinear
,SMNumericSigmoid
,SMNumericThreshold
IEP/TO - NO, IIP, IEP/TO NO, IIP, IEP/TO IEP/TO SMString
SMStringEqual
,SMStringRegexp
,SMStringTermCount
,SMStringWildcard
- - - - forAll SMTaxonomy
SMTaxonomyClassic
,SMTaxonomyClassicUserWeights
,SMTaxonomyNodeHeight
,SMTaxonomyPath
IEP/TXO - IEP/TXO IEP/TXO IEP/TXO SMTableDataObject
IEP - IEP IEP IEP -
CollectionClass
- The Similarity between two CollectionObjects (Sets or Lists) can be calculated by
two SimilarityMeasures, namely
SMCollectionIsolatedMapping
andSMCollectionMapping
. Also, there is one measure, which just works on lists, namelySMListMapping
. IntervalClass
- The Similarity between two IntervalObjects can be calculated by the
SimilarityMeasure
SMInterval
. DataClass
- A very general SimilarityMeasure called
SMTableDataClass
can be used in case that no other SimilarityMeasure fits to the DataObjects that should be compared.
Vague Knowledge
In situations where no value is assigned to an attribute it is necessary to distinguish different strategies to compute the Similarity. Three kinds of strategies can be distinguished. The chosen strategy must be stated in the domain-specific SimilarityModel.
- Optimistic Strategy
- In an optimistic strategy it is assumed that unknown values argue for similarity (Constant: OPTIMISTIC)
- Pessimistic Strategy
- In a pessimistic strategy it is assumed that unknown values argue against similarity (Constant: PESSIMISTIC).
- Average Strategy
- In an average strategy it is assumed that unknown values argue for similarity value of E. This strategy requires calculating an expectancy value E which is not always possible (Constant: AVERAGE).
Knowledge Types
For IntervalObjects one can distinguish the two knowledge types imprecise knowledge and precise knowledge. This is necessary, because intervals in the query as well as in the cases may have different semantics: The semantics in the query can be that the user specifies an interval, because the concrete value does not matter to him or her and every value within it would be acceptable (imprecise knowledge) or because he or she searches for a case, covering it completely (precise knowledge). The semantics in the case can be that either all values are supported (precise knowledge) or one or several are supported but it is not known which ones (imprecise knowledge). The choosen Knowledge Type must be stated in the domain-specific SimilarityModel.
- Precise Knowledge
- If the semantics of an interval is that all specified values are addressed and must be true simultaneously we speak of precise knowledge (Constant: PRECISE)
- Imprecise Knowledge
- If the semantics of an interval is that only a subset of values is addressed we speak of imprecise knowledge (Constant: IMPRECISE).
Collection Types
For CollectionObjects one can distinguish the three collection types case inclusion, query inclusion, and intersection. The choosen Collection Type must be stated in the domain-specific SimilarityModel. For the following definitions, assume that two CollectionObjects (one in the query and one in the case) have to be compared.
- Case Inclusion
- The Similarity is the relation between the number of similar elements in the query collection and case collection and the number of elements in the case collection (Constant: CASE_INCLUSION).
- Query Inclusion
- The Similarity is the relation between the number of similar elements in the query collection and case collection and the number of elements in the query collection (Constant: QUERY_INCLUSION).
- Intersection
- The Similarity is the relation between the number of similar elements in the query collection and case collection and the number of different elements of both collections (Constant: INTERSECTION).
Similarity and Utility
Many retrieval methods follow a simple text matching approach without using knowledge about the semantics behind. The consequences may be fatal. For example, if an available product matches the requirements of the requested product but is not proposed by the retrieval service, the opportunity for improving efficiency, quality, or time-to-market is lost. On the other hand, if a proposed product is not sufficient resources are wasted. CBR is able to minimise both potential failure classes but it strongly depends on the quality of the similarity computation.
The goal of the similarity measures is to approximate the utility. Utility describes how helpful a solution is for a new problem. Thereby the traditional binary perception (true / false) is replaced by a finer grading. The approximation through the similarity measures bases on the assumption that similar problems have similar solutions or in other words, the solution of a problem is also useful for similar problems.
The problem descriptions of previous experiences get successively compared to the new problem by using the similarity function. The resulting similarity values are used to rank the cases in the case-base and the most similar cases are used to adapt their solutions to a new solution for the new problem. Consequently, a similarity measure sim is a function that expresses the similarity between two problem descriptions as a numerical value. The higher the value, the higher is the similarity between the two cases.
Depending on the representation format several similarity measures can be defined. For the attribute value representation the local-global principle is a well established technique. Thereby, the similarity calculation is performed by combining similarity values for each single attribute of the attribute space.
The local similarity values are used by the global similarity measure to aggregate them into one similarity value for all attributes. Thereby, the importance, relevance, and utility aspect of each attribute is encoded into the function.
A detailed description of the local-global principle and an overview of concrete similarity measures is presented by Bergmann [Ber02].
References
- Ber02
- Ralph Bergmann,Experience Management - Foundations, Development Methodology, and Internet-Based Applications, Lecture Notes in Artificial Intelligence 2432, Springer Berlin, Heidelberg, New York, Hong Kong, London, Milan, paris, Tokyo, 2002
-
Interface Summary Interface Description DependencySimilarity Interface for dependency specific similarities.LogComponentIdentifier Interface with constants required for the Logger.Similarity Represents a similarity value.SimilarityCache Main interface for all similarity cachesSimilarityMeasure Main interface for all similarity measures.SimilarityModel Data Similarity ModelSimilarityModelFactoryObject The factory object forSimilarityModelFactory
SimilarityValuator The similarity valuator computes the similarity between two objects. -
Class Summary Class Description SimilarityModelFactory This Factory class provides the implementation to theSimilarityModel
interfaces.