SVM-div

Support Vector Machine for Predicting Diverse Subsets

Author:
Yisong Yue <yisongyue@gmail.com>

Version: 1.02
Date: 1/07/2011

Overview
SVM^div is a Support Vector Machine (SVM) algorithm for predicting diverse subsets (of documents). It a supervised learning approach to selecting for diversity (in information retrieval). Rather than predicting rankings (as is commonly done in information retrieval), SVM^div learns to predict (i.e. retrieve) a subset of the candidate documents. Given training data with labeled subtopics, SVM^div learns model parameters with the goal of minimizing subtopic loss in the predicted subsets. It is an implementation of the SVM^div method described in [1] (with a specific joint feature map formulation).

Predicting subsets can be thought of as type of structured prediction. SVM^div is implemented using SVM^python, which exposes a Python interface to SVM^struct. SVM^struct is a general SVM framework for learning structured prediction tasks and was developed by Thorsten Joachims. For more algorithmic details, refer to [1] for SVM^div and [2] for SVM^struct.

Source Code & Data Files
You can download the source code of SVM^div from the following location:

http://projects.yisongyue.com/svmdiv/downloads/svm-div-v1.02.tar.gz

The package above also contains a dataset based on the TREC 6-8 Interactive Track.

Compiling
Currently, SVM-div only works in a Linux environment. Windows users can run SVM-div within Cygwin. To compile, simply run 'make' in the svm-div directory.

**NOTE** - SVM-div does require Python version 2.4 or newer in order to run properly. Within the Makefile is the following line:

PYTHON = python This line indicates the location of the Python program to use. Currently, it is set to the default Python program. If your default Python program is older than version 2.4, then you will need to change this line to the location of a newer version of Python. For example:

PYTHON = /opt/bin/python2.4 You can download the latest version of Python at http://www.python.org/download/

Input Data Format
The input data file which SVM-div reads is an index file with the path+filename of all data files. Each data file should contain all the documents (aka examples) for a single query (aka set of examples).

Within a data file, each line contains the information for a single document. SVM-div assumes all documents are represented using word frequency counts oebying the following format:

[label] [word_id]:[doc_word_freq] [word_id]:[doc_word_freq] ... Labels are represented as binary strings (e.g., '10010'), where each digit corresponds to a subtopic for that query, and 1/0 indicates relevance/non-relevance. All documents should be relevant to at least one subtopic. Subtopics are allowed to change from query to query.

In addition to document word frequencies, SVM-div also uses title words. A word is denoted as being in the title be prepending the word entry with 'T'.

Features are represented sparsely. For each document, only non-zero word frequencies need to be stored in the data file. For example, a document could be represented as:

01100 T1:0.25 2:0.25 5:0.5 Here, this document is relevant to subtopics 2 & 3 for this particular query. Word 1 has frequency 0.25, word 2 has frequency 0.25, and word 3 has frequency 0.5. Notice the 'T' designation in fron the entry for word 1. This indicates that word 1 is also present in the title of the document. SVM-div ignores the frequency of words in the title, and only considers whether a word is present in the title.

All word IDs should be invariant for documents in a single query. Word IDs can change from query to query.

Refer to train_index_file and test_index_file for example index files. The data files are in the folder TREC_Interactive_Subtopic, and are the data files used for the experiments in [1].

Model Config File & Feature Implementation
The file 'config.txt' contains configuration settings for SVM-div.

The first entry RETREIVAL_SIZE indicates the size of the retrieved subsets. The default value is 5.

The second entry TF_IMPORTANCE_THRESHOLDS defines the different importance levels based on term-frequency (document word frequency) values. These thresholds correspond to importance levels defined in section 4.1 of [1]. MAKE SURE THE VALUES ARE IN INCREASING ORDER. A threshold value of 0 is automatically included. For example, a threshold value of 0.1 defines an importance level which includes all words which appear in at least 10% of some document. Note that title words are always defined as a separate importance level, independent of TF values.

The third entry BENEFIT_THRESHOLDS defines the feature thresholds for words within each importance level. These thresholds correspond to features defined in section 4.2 of [1]. MAKE SURE THE VALUES ARE IN INCREASING ORDER. A threshold value of 0 is automatically included. For each importance level, each benefit threshold defines 3 features. For importance level A and benefit threshold B, the three binary features are:

a word in importance level 0 (appears at least once in a document) appears in at least B fraction of documents at importance level A. For example, the word 'cat' appears in at leat 10% of the titles of documents in the candidate set. This feature is turned on whenever any document in the predicted subset contains at least 1 copy of the word 'cat' (i.e., importance level 0), and the word 'cat' appears in at least B fraction of the titles of the candidate documents.
a word in importance level A appears in at least B fraction of documents at importance level 0. For example, this feature is turned on when any document in the predicted subset contains 'cat' in the title and 'cat' appears in at least B fraction of the candidate documents.
a word in importance level A appears in at least B fraction of documents at importance level A. For example, this feature is turned on when any document in the predicted subset contains 'cat' in the title and 'cat' appears in at least B fraction of the titles of the candidate documents.

Let TF be the total number of TF_IMPORTANCE_TRESHOLDS values, and BT be the total number of BENEFIT_THRESHOLDS values. Then the total number of features used by SVM-div is (BF+1) + 3*TF*(BF+1) + 3*(BF+1), where the three terms correspond to features for the base level 0, each TF treshold level, and the title level, respectively. These features are defined for every word, and the final feature vector is a vector sum over all words as described in equations 3 & 4 in [1]. This is a specific feature implementation of equation 4 in [1], and is not necessarily appropriate for all retrieval tasks. Other information which may be useful include TFIDF thresholds, scores from existing retrieval functions, appearance of words as anchor text, and appearance of words in the first paragraph.

Loss Function
The loss function used during training and evaluation is weighted subtopic loss. When evaluating a predicted subset, the penalty for not covering a particular subtopic is proportional to the total number of documents in the candidate set which covers that subtopic. This loss function emphasizes the popular subtopics and mitigates the effects of labeling noise in the tail subtopics. The final computed loss value is then normalized to lie between 0 and 1.

Learning
After the program is compiled, the executable to use for learning is svm_div_learn. Use the following usage pattern:

svm_div_learn -c [c_value] [example_index_file] [model_file] where c_value is the C parameter which controls the tradeoff between regularization and training loss, example_index_file is the index file to the data files, and model_file is the file to write the trained model to. For example:

svm_div_learn -c 0.1 train_index_file temp_model will read in the example index file and train using a C parameter of 0.1. The trained model will be stored in the file temp_model.

During training, the program will sometimes report warnings that the slack of the most violated constraint is smaller than the slack of the working set. This is due to SVM^div using approximate inference during constraint generation and is expected behavior.

**NOTE** - SVM-div requires that the current shell has the PYTHONPATH environment variable set to contain '.' in the path. This allows the Python interpreter to check in the local directory to look for Python modules (in this case, svmstruct_div.py). You can either set this in your shell's profile or resource file, or set it in a shell script. For example, in bash, you can run the command

export PYTHONPATH=${PYTHONPATH}:. to add the path '.' to the PYTHONPATH environment variable to your current shell.

Classifying
After the program is compiled and a model is learned, the executable to use for classifying is svm_div_classify. Use the following usage pattern:

svm_div_classify [example_index_file] [model_file] [output_file] where example_index-file is the index file to the data files, model_file is the model to use for classification, and output_file is the file to write the classification output to. For example:

svm_div_classify test_index_file temp_model prediction.out In general, the ranking of documents is computed from sorting by the classification scores in descending order. SVM-div does not sort the documents, but only computes and outputs the classification scores in the order the documents were read in.

**NOTE** - SVM-div requires that the current shell has the PYTHONPATH environment variable set to contain '.' in the path. This allows the Python interpreter to check in the local directory to look for Python modules (in this case, svmstruct_div.py). You can either set this in your shell's profile or resource file, or set it in a shell script. For example, in bash, you can run the command

export PYTHONPATH=${PYTHONPATH}:. to add the path '.' to the PYTHONPATH environment variable.

References

[1] Y. Yue and T. Joachims. Predicting Diverse Subsets Using Structural SVMs, In Proceedings of ICML, 2008 [pdf][ppt]

[2] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large Margin Methods for Structured and Interdependent Output Variables, Journal of Machine Learning Research (JMLR), 2005 [pdf]