SVMmap

Support Vector Machine for Optimizing Mean Average Precision

Authors:
Yisong Yue <yisongyue@gmail.com>
Thomas Finley <tfinley@gmail.com>

Version: 1.02
Date: 10/31/2011

Overview
SVMmap is a Support Vector Machine (SVM) algorithm for predicting rankings (of documents). It performs supervised learning using binary labeled training examples, with the goal of optimizing Mean Average Precision (MAP). The original motivation was to learn to rank documents (where the binary labels are relevant and non-relevant).

Predicting rankings can be thought of as type of structured prediction, which is also the prediction task of SVMperf. Unlike SVMperf, the goal of SVMmap is to optimize for MAP, which is an important benchmark in the Information Retrieval community. SVMmap is implemented using SVMpython, which exposes a Python interface to SVMstruct. SVMstruct is a general SVM framework for learning structured prediction tasks and was developed by Thorsten Joachims. For more algorithmic details, refer to [1] for SVMmap, [2] for SVMperf, and [3] for SVMstruct.

Source Code
You can download the source code of SVMmap from the following locations: The source code also contains sample data files. You can also download a modified TREC dataset here (used for results in Tables 9 & 10 from [1]).

Compiling
Currently, SVM-map only works in a Linux environment. Windows users can run SVM-map within Cygwin. To compile, simply run 'make' in the svm-map directory.

**NOTE** - SVM-map does require Python version 2.4 or newer in order to run properly. Within the Makefile is the following line:
    PYTHON = python
This line indicates the location of the Python program to use. Currently, it is set to the default Python program. If your default Python program is older than version 2.4, then you will need to change this line to the location of a newer version of Python. For example:
    PYTHON = /opt/bin/python2.4
You can download the latest version of Python at http://www.python.org/download/

Input Data Format
The input data file which SVM-map reads is an index file with the path+filename of all data files. Each data file should contain all the documents (aka examples) for a single query (aka set of examples). MAP is computed as the mean of the average precision scores for each query.

Within a data file, each line contains the information for a single document. SVM-map assumes all documents are represented using a high-dimensional feature space of the following format:
    [label] qid:[qid] [feature_id]:[feature_value] [feature_id]:[feature_value] ...
Labels which are less than or equal to 0 are interpreted to mean non-relevant, and labels greater 0 are interpreted to mean relevant. The id of this particular query is stored in qid. All qid values should be the same for all documents in a data file, and qid values should be different for documents in different data files.

Features are represented sparsely. For each document, only the non-zero feature values need to be stored in the data file. For example, a document could be represented as:
    -1 qid:1 3:5 20:1 206:2
In this case, this document is not relevant to query 1, and it has 3 non-zero feature values. Feature 3 has value 5, feature 20 has value 1, and feature 206 has value 2.

Refer to sample.data and data_index_file for example data and index files. In the case of ranking documents by query, one example of a feature would be the cosine similarity between the query and the document.

Learning
After the program is compiled, the executable to use for learning is svm_map_learn. Use the following usage pattern:
    svm_map_learn -c [c_value] [example_index_file] [model_file]
where c_value is the C parameter which controls the tradeoff between regularization and training loss, example_index_file is the index file to the data files, and model_file is the file to write the trained model to. For example:
    svm_map_learn -c 0.1 data_index_file temp_model
will read in the example index file and train using a C parameter of 0.1. The trained model will be stored in the file temp_model.

**NOTE (PYTHONPATH)** - SVM-map requires that the current shell has the PYTHONPATH environment variable set to contain '.' in the path. This allows the Python interpreter to check in the local directory to look for Python modules (in this case, svmstruct_map.py). You can either set this in your shell's profile or resource file, or set it in a shell script. For example, in bash, you can run the command
    export PYTHONPATH=${PYTHONPATH}:.
to add the path '.' to the PYTHONPATH environment variable to your current shell.

**NOTE (Printing Learned Weights)** - If you want to print out the learned weight vectors after training, uncommne tthe line
    print sm.w[1:]
in the function write_struct_model in svmstruct_map.py (Line 484).

Classifying
After the program is compiled and a model is learned, the executable to use for classifying is svm_map_classify. Use the following usage pattern:
    svm_map_classify [example_index_file] [model_file] [output_file]
where example_index-file is the index file to the data files, model_file is the model to use for classification, and output_file is the file to write the classification output to.

In general, the ranking of documents is computed from sorting by the classification scores in descending order. SVM-map does not sort the documents, but only computes and outputs the classification scores in the order the documents were read in.

**NOTE (PYTHONPATH)** - SVM-map requires that the current shell has the PYTHONPATH environment variable set to contain '.' in the path. This allows the Python interpreter to check in the local directory to look for Python modules (in this case, svmstruct_map.py). You can either set this in your shell's profile or resource file, or set it in a shell script. For example, in bash, you can run the command
    export PYTHONPATH=${PYTHONPATH}:.
to add the path '.' to the PYTHONPATH environment variable.

References
  • [1] Y. Yue, T. Finley, F. Radlinski, and T. Joachims. A Support Vector Method for Optimizing Average Precision, In Proceedings of SIGIR, 2007 [pdf][ppt]
  • [2] T. Joachims A Support Vector Method for Multivariate Performance Measures, In Proceedings of ICML, 2005 [pdf]
  • [3] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large Margin Methods for Structured and Interdependent Output Variables, Journal of Machine Learning Research (JMLR), 2005 [pdf]


  • [All Content © 2024 Yisong Yue]