net.jxta.search.relevance
Class TFIDFRelevance
java.lang.Object
|
+--net.jxta.search.relevance.TFIDFRelevance
- public class TFIDFRelevance
- extends java.lang.Object
Used to compute relevance of search results for opensearch.
Method Summary |
static void |
clearDebug()
|
static float[] |
compute_idf(int num_docs,
int[] DocCount)
Compute inverse document frequency. |
static float[][] |
compute_tf_log(int[][] TermCount)
Compute term frequency using log method. |
static float[][] |
compute_tf_max(int[][] TermCount)
Compute term frequency using the max counts of term per doc. |
static float[] |
computeTFIDFSimilarity(float[][] TF,
float[] IDF,
int[] WordCount)
Simple TFIDF weighting is used for a similarity measure. |
static float[] |
findRel(java.lang.String queryString,
java.util.ArrayList docs)
findRel takes a String "queryString" and an ArrayList "docs". |
static java.lang.String |
getTermCountAlgorithm()
|
static void |
main(java.lang.String[] argv)
|
static java.lang.String[] |
makeQueryArray(java.lang.String queryString)
|
static void |
setDebug()
|
static void |
setStem()
|
static void |
setTermCountBM()
|
static void |
setTermCountNaive()
|
static boolean |
test(java.io.PrintStream ps)
|
static void |
unsetStem()
|
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
TFIDFRelevance
public TFIDFRelevance()
TFIDFRelevance
public TFIDFRelevance(java.lang.String count_type)
setDebug
public static void setDebug()
clearDebug
public static void clearDebug()
setStem
public static void setStem()
unsetStem
public static void unsetStem()
setTermCountNaive
public static void setTermCountNaive()
setTermCountBM
public static void setTermCountBM()
getTermCountAlgorithm
public static java.lang.String getTermCountAlgorithm()
compute_idf
public static float[] compute_idf(int num_docs,
int[] DocCount)
- Compute inverse document frequency.
See Baeza-Yates, p. 29,
Eq. 2.3, where we are using the weights directly as
similarity value.
FIXME: This method uses sqrt, but BY uses log. Why?
- Parameters:
int
- num_docs total number of document exclusive
of the query, must be >= 0.int
- [] DocCount Each index i of this array points
the number of times the ith query term appears in
a document. Each entry must be >= 0.- Returns:
- float [] IDF Inverse Document Frequency of the ith
query term with respect to the all the documents,
computed using sqrt instead of log.
NOTE: No range checking is done in this method, and the
behavior with invalid parameters is unkown.
compute_tf_log
public static float[][] compute_tf_log(int[][] TermCount)
- Compute term frequency using log method.
FIXME: Need a reference for this.
- Parameters:
int
- [][] TermCount represents each document as a row d,
and the number of times some query term i
appears in the ith column for the dth document.
.- Returns:
- float [][] TF is the term counts weighted by
1 + the log of the counts of each term
in some document d.
compute_tf_max
public static float[][] compute_tf_max(int[][] TermCount)
- Compute term frequency using the max counts of term per doc.
See Baeza-Yates, p. 29.
- Parameters:
int
- [][] TermCount represents each document as a row d,
and the number of times some query term i
appears in the ith column for the dth document.
.- Returns:
- float [][] TF is the term counts weighted by the count
of the most frequent term in some document d.
findRel
public static float[] findRel(java.lang.String queryString,
java.util.ArrayList docs)
- findRel takes a String "queryString" and an ArrayList "docs".
- Parameters:
String
- queryString could be a string of terms typed
by a user doing a search.ArrayList
- docs is a list of "documents" whose member
values consist of a string for each document that might
be all or part of a search result.- Returns:
- float [] Sim is a vector of similarity values computed
using the tf/idf method. See
Baeza-Yates for details.
makeQueryArray
public static java.lang.String[] makeQueryArray(java.lang.String queryString)
computeTFIDFSimilarity
public static float[] computeTFIDFSimilarity(float[][] TF,
float[] IDF,
int[] WordCount)
- Simple TFIDF weighting is used for a similarity measure.
FIXME: WordCount is used for the summing part of this and
appears that it can be removed from the call.
test
public static boolean test(java.io.PrintStream ps)
main
public static void main(java.lang.String[] argv)