svm_rank.py 6.92 KB
Newer Older
Davis King's avatar
Davis King committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
#!/usr/bin/python
# The contents of this file are in the public domain. See LICENSE_FOR_EXAMPLE_PROGRAMS.txt
#
# 
# This is an example illustrating the use of the SVM-Rank tool from the dlib C++
# Library.  This is a tool useful for learning to rank objects.  For example,
# you might use it to learn to rank web pages in response to a user's query.
# The idea being to rank the most relevant pages higher than non-relevant pages.
# 
# In this example, we will create a simple test dataset and show how to learn a
# ranking function from it.  The purpose of the function will be to give
# "relevant" objects higher scores than "non-relevant" objects.  The idea is
# that you use this score to order the objects so that the most relevant objects
# come to the top of the ranked list.
#
16
17
18
19
20
21
22
23
24
25
26
27
28
29
#
# COMPILING/INSTALLING THE DLIB PYTHON INTERFACE
#   You can install dlib using the command:
#       pip install dlib
#
#   Alternatively, if you want to compile dlib yourself then go into the dlib
#   root folder and run:
#       python setup.py install
#   or
#       python setup.py install --yes USE_AVX_INSTRUCTIONS
#   if you have a CPU that supports AVX instructions, since this makes some
#   things run faster.  
#
#   Compiling dlib should work on any operating system so long as you have
30
31
32
#   CMake installed.  On Ubuntu, this can be done easily by running the
#   command:
#       sudo apt-get install cmake
33
34
#

Davis King's avatar
Davis King committed
35
36
37
import dlib


38
39
40
# Now let's make some testing data.  To make it really simple, let's suppose
# that we are ranking 2D vectors and that vectors with positive values in the
# first dimension should rank higher than other vectors.  So what we do is make
Davis King's avatar
Davis King committed
41
42
43
# examples of relevant (i.e. high ranking) and non-relevant (i.e. low ranking)
# vectors and store them into a ranking_pair object like so:
data = dlib.ranking_pair()
44
45
# Here we add two examples.  In real applications, you would want lots of
# examples of relevant and non-relevant vectors.
Davis King's avatar
Davis King committed
46
47
48
49
50
51
52
53
54
55
56
57
58
data.relevant.append(dlib.vector([1, 0]))
data.nonrelevant.append(dlib.vector([0, 1]))

# Now that we have some data, we can use a machine learning method to learn a
# function that will give high scores to the relevant vectors and low scores to
# the non-relevant vectors.
trainer = dlib.svm_rank_trainer()
# Note that the trainer object has some parameters that control how it behaves.
# For example, since this is the SVM-Rank algorithm it has a C parameter that
# controls the trade-off between trying to fit the training data exactly or
# selecting a "simpler" solution which might generalize better. 
trainer.c = 10

Davis King's avatar
Davis King committed
59
# So let's do the training.
Davis King's avatar
Davis King committed
60
61
62
63
64
rank = trainer.train(data)

# Now if you call rank on a vector it will output a ranking score.  In
# particular, the ranking score for relevant vectors should be larger than the
# score for non-relevant vectors.  
65
66
67
68
print("Ranking score for a relevant vector:     {}".format(
    rank(data.relevant[0])))
print("Ranking score for a non-relevant vector: {}".format(
    rank(data.nonrelevant[0])))
69
# The output is the following:
Davis King's avatar
Davis King committed
70
71
72
73
74
75
76
77
78
#    ranking score for a relevant vector:     0.5
#    ranking score for a non-relevant vector: -0.5


# If we want an overall measure of ranking accuracy we can compute the ordering
# accuracy and mean average precision values by calling test_ranking_function().
# In this case, the ordering accuracy tells us how often a non-relevant vector
# was ranked ahead of a relevant vector.  In this case, it returns 1 for both
# metrics, indicating that the rank function outputs a perfect ranking.
79
print(dlib.test_ranking_function(rank, data))
Davis King's avatar
Davis King committed
80

Davis King's avatar
Davis King committed
81
82
83
# The ranking scores are computed by taking the dot product between a learned
# weight vector and a data vector.  If you want to see the learned weight vector
# you can display it like so:
84
print("Weights: {}".format(rank.weights))
Davis King's avatar
Davis King committed
85
# In this case the weights are:
Davis King's avatar
Davis King committed
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
#  0.5 
# -0.5 

# In the above example, our data contains just two sets of objects.  The
# relevant set and non-relevant set.  The trainer is attempting to find a
# ranking function that gives every relevant vector a higher score than every
# non-relevant vector.  Sometimes what you want to do is a little more complex
# than this. 
#
# For example, in the web page ranking example we have to rank pages based on a
# user's query.  In this case, each query will have its own set of relevant and
# non-relevant documents.  What might be relevant to one query may well be
# non-relevant to another.  So in this case we don't have a single global set of
# relevant web pages and another set of non-relevant web pages.  
#
# To handle cases like this, we can simply give multiple ranking_pair instances
# to the trainer.  Therefore, each ranking_pair would represent the
# relevant/non-relevant sets for a particular query.  An example is shown below
# (for simplicity, we reuse our data from above to make 4 identical "queries").
queries = dlib.ranking_pairs()
queries.append(data)
queries.append(data)
queries.append(data)
queries.append(data)

# We can train just as before.  
rank = trainer.train(queries)

# Now that we have multiple ranking_pair instances, we can also use
# cross_validate_ranking_trainer().  This performs cross-validation by splitting
# the queries up into folds.  That is, it lets the trainer train on a subset of
# ranking_pair instances and tests on the rest.  It does this over 4 different
# splits and returns the overall ranking accuracy based on the held out data.
# Just like test_ranking_function(), it reports both the ordering accuracy and
# mean average precision.
121
122
print("Cross validation results: {}".format(
    dlib.cross_validate_ranking_trainer(trainer, queries, 4)))
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138

# Finally, note that the ranking tools also support the use of sparse vectors in
# addition to dense vectors (which we used above).  So if we wanted to do
# exactly what we did in the first part of the example program above but using
# sparse vectors we would do it like so:

data = dlib.sparse_ranking_pair()
samp = dlib.sparse_vector()

# Make samp represent the same vector as dlib.vector([1, 0]).  In dlib, a sparse
# vector is just an array of pair objects.  Each pair stores an index and a
# value.  Moreover, the svm-ranking tools require sparse vectors to be sorted
# and to have unique indices.  This means that the indices are listed in
# increasing order and no index value shows up more than once.  If necessary,
# you can use the dlib.make_sparse_vector() routine to make a sparse vector
# object properly sorted and contain unique indices. 
139
samp.append(dlib.pair(0, 1))
140
141
data.relevant.append(samp)

Davis King's avatar
Davis King committed
142
# Now make samp represent the same vector as dlib.vector([0, 1])
143
samp.clear()
144
samp.append(dlib.pair(1, 1))
145
146
147
148
data.nonrelevant.append(samp)

trainer = dlib.svm_rank_trainer_sparse()
rank = trainer.train(data)
149
150
151
152
print("Ranking score for a relevant vector:     {}".format(
    rank(data.relevant[0])))
print("Ranking score for a non-relevant vector: {}".format(
    rank(data.nonrelevant[0])))
153
154
155
# Just as before, the output is the following:
#    ranking score for a relevant vector:     0.5
#    ranking score for a non-relevant vector: -0.5