Retriever Evaluation

Retriever Evaluation #

This page contains the following content:

In ProCAKE, a class named RetrieverEvaluation is provided, that can be used to evaluate different retrievers or differently configured retrievers. Mostly, it’s used to evaluate whether a proposed MAC/FAC retrieval approach helps to improve the retrieval time without significantly reducing the retrieval quality.

Throughout the evaluation, different metrics can be used to compare the retrieval results w.r.t. retieval quality and performance. To measure the performance, the retrieval time in seconds is taken. The quality is evaluated by comparing the results of a retriever to the ground-truth retrieval results in terms of several quality measures that are explained in the remainder of this page.

For all evaluations, a test and a train case base are used. The test case base is the case base where all queries are taken from. It is usually much smaller than the train case base. The train case base has the purposes of being used as a case base to retrieve from. Thus, all pairs of computed similarities from a retrieval are determined between a query from the test case base and a case from the train case base.

Quality Metrics #

In the following, the applicable quality metrics and their usage in ProCAKE are described.

Every metric has three methods inherited from the abstract class EvalMetricComputer, that can be used for the computation.

  1. getMetricName(): Returns the name of the metric as string.
  2. computeRankingMetric(SimpleSimilarityResult groundTruthRankingResult,SimpleSimilarityResult predictedRankingResult): Computes the metric for two retrieval results. The first is the ground-truth retrieval result, the second is the retrieval results to evaluate, i.e., the results that stem from the retriever to evaluate. A value in interval range of \([0,1]\) is returned. A corresponding class for every metric described below exists.

For some metrics, it’s possible to set a parameter k, that specifies the number of retrieval results to include in the computation. It’s only necessary for some metrics that focus on the top-k results.

For the description of the metrics, the following example is used. One ground-truth result list and a result list to evaluate is given. Both contain the graph G1 as query and compute the similarity to the graphs G2, G3 and G4. We assume, the following similarities:

simground-truth resultsresults to evaluate
sim(G1,G2)0.50.5
sim(G1,G3)0.30.2
sim(G1,G4)0.20.3

Please note that according to the similarity values the nearest neighbors of the ground-truth results are different ones than the nearest neighbors of the results to evaluate. Despite, the different similarity values are very close to each other.

Hits #

The Hits metric describes the number of cases that are contained in the ground truth results and the results to evaluate. For every object, that is represented in both lists, the hits counter is increased by 1. If the hits value is equal to the number of the ground-truth results, both result lists contain the same elements. The similarity values are ignored in this metric. It’s possible to set the value k, which is only used for the truncation of the retrieval result list. This metric is the only one, which isn’t normalized.

For the example, the Hits metric would yield a value of 3.0, given k is 3. This is the highest possible score in our scenario, because all three elements are present in both lists. When using 2 as the parameter value for k, the metric value would be 1.0, because only one matching partner could be found among the 2 cases with the highest similarity of both result lists.

To get a normalized value of this metric, a metric Hits normalized does exist. For this, the value of the Hits metric is divided whether by k or the length of the results list. For the example, the metric value would be 1.0 without a value set for k and 0.5 using 2 as value for k.

Mean Absolute Error (MAE) #

The Mean Absolute Error metric is the average absolute prediction error between the ground truth results and the results to evaluate. Thereby, the similarity values of the graph pairs at the same positions are compared for both result lists. Its value is in the range of \(0\) to \(1\) , where \(1\) is the highest possible error and \(0\) means, that no error is present.

The Mean Absolute Error is calculated with the following formula:

\(MAE = \frac{ 1 } { |RL| } * \sum \limits_{(QW,CW) \in RL} { | sim(QW,CW) - fsim(QW,CW)} |\)

Here, \(QW\) and \(CW\) stand for the two graphs (query and case workflow). \(RL\) stands for the retrieval result list to evaluate, \(fsim\) for the predicted similarity between both graphs, and \(sim\) for the ground-truth similarity label.

For the example, the following term stands: \(MAE = \frac{ |0.5 - 0.5 | + |0.3 - 0.3| + |0.2-0.2| }{ 3 } = \frac{0}{3} = 0.0\) . So, there is no Mean Absolute Error in this case.

For more details, the paper Using Siamese Graph Neural Networks for Similarity-Based Retrieval in Process-Oriented Case-Based Reasoning is recommended.

Mean Squared Error (MSE) #

The Mean Squared Error metric is the average squared prediction error between the ground-truth results and the results to evaluate. Its value is in the range of \(0\) to \(1\) , with higher values indicating decreased quality. It is generally very similar to the MAE metric.

The Mean Squared Error is calculated with the following formula:

\(MSE = \frac{ 1 }{ |RL| } * \sum \limits_{(QW,CW) \in RL} { (sim(QW,CW) - fsim(QW,CW))^2 }\)

\(RL\) stands for the retrieval result list to evaluate, \(fsim\) the predicted similarity between both graphs, and \(sim\) the ground-truth similarity label.

For our example, the following term results: \(MSE = \frac{ 1 }{ 3 } * ((0.5 - 0.5)^2 + (0.3 - 0.3)^2 + (0.2 - 0.2)^2) = \frac{0}{150} = 0.0\)

Quality Stromer #

The Quality Stromer metric measures the quality of a MAC/FAC retrieval. Its value is in the range from \(0\) to \(1\) , where \(1\) is the maximum and \(0\) the minimum quality. It uses an error function err(i), which is \(1\) , if the case at the retrieval position i from the ground-truth results is missing at this position in the results to evaluate. Otherwise, it’s \(0\) .

The metric is computed as follows:

\( quality_{stromer}(err,k) = \frac{ \sum_{ i=1 }^{ k }{ err(i) * (2 + k - i) } }{ \sum_{ i=1 }^{ k }{(2 + k - i) } }\)

If k wasn’t set, the number of the cases will be taken. The first elements in the result list are weighted higher than the lower ones.

For the example, the following formula would be calculated: \( quality_{stromer}(err,k) = \frac{ err(1) * (2 + 3 - 1) + err(2) * (2 + 3 - 2) + err(3) * (2 + 3 - 3) }{ (2 + 3 - 1) + (2 + 3 - 2) + (2 + 3 - 3) } = \frac{ 1 * 4 + 0 * 3 + 0 * 2 }{ 4 + 3 + 2 } = \frac{4}{9} \approx 0.4444 \) . When using 2 as value for k, the following would be computed: \( quality_{stromer}(err,r) = \frac{ err(1) * (2 + 2 - 1) + err(2) * (2 + 2 - 2)}{ (2 + 2 - 1) + (2 + 2 - 2)} = \frac{ 1 * 3 + 0 * 2 }{ 3 + 2 } = \frac{3}{5} = 0.6 \)

For more details, reading the paper MAC/FAC Retrieval of Semantic Workflows is recommended.

Quality Mueller #

The Quality Mueller metric measures the quality of a retrieval. Its value is in the range from \(0\) to \(1\) , where \(1\) is the maximum and \(0\) the minimum quality. The metric determines this value by subtracting the ground truth similarities of all retrieval results that are missing in the top-k cases from \(1\) . Additionally, the values are combined with a discount that results in more relevant missing cases affecting the quality even more. If k is not set, the complete list will be considered. If the Hits metric has a value of \(0\) , the quality might not be \(0\) in the same setup.

The Quality Mueller is computed with the following formula:

\( quality_{mueller}(QW,RL) = 1 - \frac{ 1}{ |RL| } * \sum \limits_{CW \in \{ MSC(QW,|RL|)\backslash RL \} }{sim(QW,CW)} \)

Here, \(MSC(QW,|RL|)\) stands for the ground-truth results and \(RL\) for the results of the retriever to evaluate. The ground truth results are truncated to the \(|RL|\) most similar workflows (second parameter of MSC), thus having the same number of workflows as RL. To measure the quality of RL, every workflow in this retrieval result is searched in the ground-truth similarity result MSC.

For the example, the following term results: \( quality_{mueller}(QW,RL) = 1 - \frac{1}{3} * (0.0) = 1\) . That is because every result in the ground-truth-results has an associated partner in the results to evaluate.

When using 2 as value for k, only the first two elements will be considered, resulting in: \( quality_{mueller}(QW,RL) = 1 - \frac{1}{2} * (0.3) = \frac{17}{20} = 0.85\) . Here, the order of the ground truth results and the results to evaluate is important, because a different order would cause a different metric value.

For more details, reading the paper A Cluster-Based Approach to Improve Similarity-Based Retrieval for Process-Oriented Case-Based Reasoning is recommended.

Correctness #

The Correctness metric is the ratio of concordant and discordant pairs regarding all pairs. Its value is in the range of \(-1\) to \(1\) . \(-1\) means, that only discordant pairs are given and \(1\) means that only concordant pairs are given. A Correctness value of \(0\) is the least desirable value.

For this measure, new symbols are introduced. \(\sqsubset\) expresses, which workflow was ranked before the other workflow in \(RL\) . \(\sqsubset_{\ast}\) stands for the same information, but for the ranking as it would result from the ground-truth similarities.

A pair of workflows is concordant, if the same order in the retrieval results to evaluate and the ground-truth retrieval results is given. This means, that every element is either ranked both higher or lower in the results. The concordant value can be computed as follows:

\( C = \{ (CW_{1}, CW_{2}) \in RL \ | \ (CW_{1} \sqsubset CW_{2} \land CW{1} \sqsubset_{\ast} CW_{2}) \lor (CW_{2} \sqsubset CW_{1} \land CW_{2} \sqsubset_{\ast} CW_{1}) \} \)

A pair of workflows is discordant, if both pairs are ordered in another way in the retrieval results to evaluate as they should be according to the ground-truth similarities. Thus, if a pair is not concordant then it is discordant (assuming there are no ties which is not the case in our scenarios). The discordant value is calculated with the following formula:

\( D = \{ (CW_{1}, CW_{2}) \in RL \ | \ (CW_{1} \sqsubset CW_{2} \land CW{2} \sqsubset_{\ast} CW_{1}) \lor (CW_{2} \sqsubset CW_{1} \land CW_{1} \sqsubset_{\ast} CW_{2}) \} \)

Using the computed values of concordant and discordant pairs, the Correctness can be computed with the following formula:

\( correctness(\sqsubset,\sqsubset_{\ast}) = \frac{|C|-|D|}{|C|+|D|} \)

The metric combines the number of concordant and discordant pairs and computes the ratio of concordant and discordant pairs regarding all pairs. The value k is only used for the truncation of the retrieval results.

For the example, two concordant and one discordant pair exists. G2 is in both result lists ranked higher than G3 and G4, so two concordant pairs are found. Because the comparison is only made in a downward direction, there is only one possible pair left, containing G3 and G4. In both lists, they are ranked different, so this pair is computed as discordant. So, the metric would be calculated as follows: \( correctness(\sqsubset,\sqsubset_{\ast}) = \frac{2-1}{2+1} = \frac{1}{3} \approx 0.3333 \) .

For more details, the paper Using Siamese Graph Neural Networks for Similarity-Based Retrieval in Process-Oriented Case-Based Reasoning is recommended.

Completeness #

The Completeness metric also relies on the definitions of concordant and discordant pairs from the Correctness metric. It integrates the scenario, that case workflows can have the same similarity, which doesn’t necessary lead to a completely defined ranking. Thus, the completeness decreases if there are ties in the result list. The value k is only used for the truncation of the retrieval result list.

The Completeness is computed by the following formula:

\( completeness(\sqsubset,\sqsubset_{\ast}) = \frac{|C|+|D|}{|\succ|} \)

Here, \(\succ\) stands for the comparisons, which were made.

For the example, the calculations of the concordant and discordant pairs can be read above. The count of comparisons is 3, because there are 3 possible pairs, which can be considered. So, the following metric would be computed: \( completeness(\sqsubset,\sqsubset_{\ast}) = \frac{2 + 1}{3} = 1.0 \) .

Distance #

The Distance metric computes the average distance in ranks between the ground-truth ranking and the predicted ranking.

The Distance is computed by the following formula:

\( distance(QW,RL) = \frac{ 1 }{ |RL| } * \sum \limits_{(QW,CW) \in RL}{ (|index(QW,CW) - index(QW,CW)|) } \)

For the example, two cases have a rank difference of \(1\) , one case pair has a rank difference of \(0\) . So, the following metric would be computed: \(\frac{1 + 1 + 0}{3} = \frac{2}{3} \approx 0.66\)

Kendall #

The Kendall metric also relies on the definitions of concordant and discordant pairs from the Correctness metric.

The Kendall metric is computed by the following formular:

\(\tau ={\frac {({\text{number of concordant pairs}})-({\text{number of discordant pairs}})}{n \choose 2}}.\)

For the example, the calculations of the concordant and discordant pairs can be read above. So, the following metric would be computed: \( \tau = \frac {2-1}{3 \choose 2} = \frac{1}{3} \approx 0.33\) .

Spearman #

The Spearman metric assesses how well the relationship between two variables can be described using a monotonic function.

The Spearman metric is computed by the following formular:

\(r_{s}=\rho _{\operatorname {rg} _{X},\operatorname {rg} _{Y}}={\frac {\operatorname {cov} (\operatorname {rg} _{X},\operatorname {rg} _{Y})}{\sigma _{\operatorname {rg} _{X}}\sigma _{\operatorname {rg} _{Y}}}}\)

Here, \(\rho\) denotes the usual Pearson correlation coefficient (see here). \(\operatorname {rg} _{X},\operatorname {rg} _{Y}\) is the covariance of the rank variables \(\sigma _{\operatorname {rg} _{X}}\) and \(\operatorname {rg} _{Y}\) are the standard deviations of the rank variables.

More information about the calculation can be found here. For the example, a Spearman metric of \(0.5\) would be computed.

Compilation of metrics used on example #

In the following, the calculated metrics for the examples of the ground-truth-results and the results to evaluate - containing G2, G3 and G4 - are set against each other in the table. Once, no value for k is set, the other time the value is 2.

metricresultsresults (k=2)
\(hits\)3.01.0
\(hits_{norm}\)1.00.5
\(MAE\)0.0
\(MSE\)0.0
\(quality_{stromer}\)0.44440.6
\(quality_{mueller}\)1.00.85
\(correctness\)0.3333-
\(completeness\)1.0-
\(distance\)0.6666-
\(kendall\)0.3333-
\(spearman\)0.5-

Usage #

Retriever Evaluation #

Before using the RetrieverEvaluation, ensure that ProCAKE is started. An object of the class can be created by using an empty constructor.

Wiki_RetrieverEvaluationTest.java

    RetrieverEvaluation retrieverEvaluation = new RetrieverEvaluation();

Before running the evaluation, train and test case base have to be set, as well as the retrievers and metrics to evaluate. Train and test case base must be instances of ReadableObjectPool. How to create and use such pools can be read here. The test cases will be used as queries, the train cases will be used as the case base to retrieve from. If only one case base is desired one can simply set the test case base and the train case base to the same object. The values can be set as follows:

Wiki_RetrieverEvaluationTest.java

    retrieverEvaluation.setTrainTestCaseBase(trainCaseBase, testCaseBase);

Alternatively, the case bases can also be passed in the constructor, when creating the evaluation:

Wiki_RetrieverEvaluationTest.java

    new RetrieverEvaluation(trainCaseBase, testCaseBase);

Also, instances of the retrievers to evaluate have to be created. How to do this, can be read here. These can be set by using the method addRetrieverToEvaluate(String uniqueRetrieverName, Retriever<NESTGraphObject, Query> retriever). It is important to ensure, that the string uniqueRetrieverName is unique, so that this retriever can be accessed and referenced by this name.

The call of this method can look like this:

Wiki_RetrieverEvaluationTest.java

    retrieverEvaluation.addRetrieverToEvaluate("Example retriever", linearRetriever);

For the metrics, instances have to be created, too. For these, every metric described above can be used and added by using the method addMetricToEvaluate(EvalMetricComputer metric). For instance, this can look like:

Wiki_RetrieverEvaluationTest.java

    retrieverEvaluation.addMetricToEvaluate(new MSEMetric());
    retrieverEvaluation.addMetricToEvaluate(new MAEMetric());
    retrieverEvaluation.addMetricToEvaluate(new CompletenessMetric());

Afterwards, by using the class RetrieverEvaluationUtils the static method computeGroundTruthSimilarities(Retriever<NESTWorkflowObject, Query> groundTruthRetriever, ReadableObjectPool<NESTWorkflowObject> trainCaseBase, ReadableObjectPool<NESTWorkflowObject> testCaseBase, String pathGroundTruthSimilarities) has to be called. Here, one retriever must be given for to compute the ground-truth similarities. The string pathGroundTruthSimilarities specifies the path, where the computed similarities can be stored as a CSV file. If it is null, no file will be stored. The computed values are stored in a map, which contains the ID of every graph and the corresponding retrieval results. The computed ground-truth similarities can be stored by using the method setGroundTruthSimilarities(List<SimpleSimilarityResult> gtSimilarities).

The call of this method can look like this:

Wiki_RetrieverEvaluationTest.java

    try {
      List<SimpleSimilarityResult> groundTruthSimilarities = RetrieverEvaluationUtils.computeGroundTruthSimilarities(linearRetriever, trainCaseBase, testCaseBase, null);
      retrieverEvaluation.setGroundTruthSimilarities(groundTruthSimilarities);
    } catch (IOException e) {
      e.printStackTrace();
    }
It is important to surround this method with a try-catch-statement, because it can throw an IOException. The exception is thrown in the process of opening and writing to the file to store the ground truth similarities in. Here, a linear retriever is used as the ground truth retriever, which has to be created earlier in the code.

To perform the evaluation, the method performEvaluation() has to be used. It returns the evaluation results as a map with a RetrieverFSKMetricKeyPair, consisting of a retriever name and a metric, as key and the metric value as double. Here, every retriever will compute its similarities and keep the track of retrieval time. The results are used for the computation of all metrics, which will be run in parallel. Afterwards, the average values of all metrics are calculated. This can look like:

Wiki_RetrieverEvaluationTest.java

    try {
      Map<RetrieverMetricKeyPair, Double> results = retrieverEvaluation.performEvaluation();
    } catch (RetrieverEvaluationException e) {
      e.printStackTrace();
    }
Again, it is important to catch a RetrieverEvalException that can appear during the evaluation due to various reasons. For example, if the retrieval results of a retriever don’t contain all cases from the case base or if the file with the ground truth similarities cannot be opened.

The retrieval results can be written to a file using the writeResultsToCSV method. This method exists with two different signatures: writeResultsToCSV(OutputStream outputStream) and writeResultsToCSV(String exportPathExportResults). As before, if it’s null no output will be created. Here, it’s important to cast ’null’ as String-object, because the output stream method cannot handle null objects. Otherwise, the export results will be serialized to this path as a CSV file. The values computed before will be written to the console and - if a string is set - to the CSV file. This can look like:

Wiki_RetrieverEvaluationTest.java

    try {
      retrieverEvaluation.writeMetricResultsAsCSV(PATH_WORKING_DIRECTORY + "evalResults.csv");
    } catch (IOException | RetrieverEvaluationException e) {
      fail(e);
    }
Again, it is important to catch an IOException as well as a RetrieverEvalException that can both appear during the evaluation due to various reasons.

Alternatively, it is also possible to use the getResultsAsCSVString() method to have only the results returned as a string. In this case, the results are also output to the console. This can look like this:

Wiki_RetrieverEvaluationTest.java

      resultsAsString = retrieverEvaluation.getMetricResultsAsCSVString();
A runnable test class can be found as RetrieverEvaluationTest.

MAC/FAC Retriever Evaluation #

There is a class MACFACRetrieverEvaluation which inherits from the standard class RetrieverEvaluation. Train and test casebase are set here as with the standard class, as is the addition of retrievers and metrics and the setting of ground truth similarities. It is created analogously:

Wiki_RetrieverEvaluationTest.java

    MACFACRetrieverEvaluation macfacRetrieverEvaluation = new MACFACRetrieverEvaluation();

Additionally, a FAC retriever has to be set. This can look like this:

Wiki_RetrieverEvaluationTest.java

    macfacRetrieverEvaluation.setFacGTRetriever(linearRetriever2);

Additionally, filter sizes and k-values can be set here. The filter sizes determine the number of results of the MAC phase. The k-values determine the number of final results. Both values are set as arrays, which must be the same size. The values at the positions with the same index are pairs. The creation of the values can look like this:

Wiki_RetrieverEvaluationTest.java

    macfacRetrieverEvaluation.setFilterSizes(new int[]{1, 2});
    macfacRetrieverEvaluation.setKs(new int[]{1, 2});

The call to the performEvaluation() method is the same as for the standard class. The returned class can be cast as (Map<RetrieverFSKMetricKeyPair, Double>), since the first parameter here contains the values for the filter sizes and the ks in addition to the retriever and the metric. The output as string or CSV is also analogous.