net.jxta.search.relevance
Class TFIDFRelevance

java.lang.Object
  |
  +--net.jxta.search.relevance.TFIDFRelevance

public class TFIDFRelevance
extends java.lang.Object

Used to compute relevance of search results for opensearch.


Constructor Summary
TFIDFRelevance()
           
TFIDFRelevance(java.lang.String count_type)
           
 
Method Summary
static void clearDebug()
           
static float[] compute_idf(int num_docs, int[] DocCount)
          Compute inverse document frequency.
static float[][] compute_tf_log(int[][] TermCount)
          Compute term frequency using log method.
static float[][] compute_tf_max(int[][] TermCount)
          Compute term frequency using the max counts of term per doc.
static float[] computeTFIDFSimilarity(float[][] TF, float[] IDF, int[] WordCount)
          Simple TFIDF weighting is used for a similarity measure.
static float[] findRel(java.lang.String queryString, java.util.ArrayList docs)
          findRel takes a String "queryString" and an ArrayList "docs".
static java.lang.String getTermCountAlgorithm()
           
static void main(java.lang.String[] argv)
           
static java.lang.String[] makeQueryArray(java.lang.String queryString)
           
static void setDebug()
           
static void setStem()
           
static void setTermCountBM()
           
static void setTermCountNaive()
           
static boolean test(java.io.PrintStream ps)
           
static void unsetStem()
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

TFIDFRelevance

public TFIDFRelevance()

TFIDFRelevance

public TFIDFRelevance(java.lang.String count_type)
Method Detail

setDebug

public static void setDebug()

clearDebug

public static void clearDebug()

setStem

public static void setStem()

unsetStem

public static void unsetStem()

setTermCountNaive

public static void setTermCountNaive()

setTermCountBM

public static void setTermCountBM()

getTermCountAlgorithm

public static java.lang.String getTermCountAlgorithm()

compute_idf

public static float[] compute_idf(int num_docs,
                                  int[] DocCount)
Compute inverse document frequency. See Baeza-Yates, p. 29, Eq. 2.3, where we are using the weights directly as similarity value. FIXME: This method uses sqrt, but BY uses log. Why?
Parameters:
int - num_docs total number of document exclusive of the query, must be >= 0.
int - [] DocCount Each index i of this array points the number of times the ith query term appears in a document. Each entry must be >= 0.
Returns:
float [] IDF Inverse Document Frequency of the ith query term with respect to the all the documents, computed using sqrt instead of log. NOTE: No range checking is done in this method, and the behavior with invalid parameters is unkown.

compute_tf_log

public static float[][] compute_tf_log(int[][] TermCount)
Compute term frequency using log method. FIXME: Need a reference for this.
Parameters:
int - [][] TermCount represents each document as a row d, and the number of times some query term i appears in the ith column for the dth document. .
Returns:
float [][] TF is the term counts weighted by 1 + the log of the counts of each term in some document d.

compute_tf_max

public static float[][] compute_tf_max(int[][] TermCount)
Compute term frequency using the max counts of term per doc. See Baeza-Yates, p. 29.
Parameters:
int - [][] TermCount represents each document as a row d, and the number of times some query term i appears in the ith column for the dth document. .
Returns:
float [][] TF is the term counts weighted by the count of the most frequent term in some document d.

findRel

public static float[] findRel(java.lang.String queryString,
                              java.util.ArrayList docs)
findRel takes a String "queryString" and an ArrayList "docs".
Parameters:
String - queryString could be a string of terms typed by a user doing a search.
ArrayList - docs is a list of "documents" whose member values consist of a string for each document that might be all or part of a search result.
Returns:
float [] Sim is a vector of similarity values computed using the tf/idf method. See Baeza-Yates for details.

makeQueryArray

public static java.lang.String[] makeQueryArray(java.lang.String queryString)

computeTFIDFSimilarity

public static float[] computeTFIDFSimilarity(float[][] TF,
                                             float[] IDF,
                                             int[] WordCount)
Simple TFIDF weighting is used for a similarity measure. FIXME: WordCount is used for the summing part of this and appears that it can be removed from the call.

test

public static boolean test(java.io.PrintStream ps)

main

public static void main(java.lang.String[] argv)