Monday, May 13, 2013

Evaluating models note

How to evaluate a language model?

To evaluate a model good or bad, it really depends on the performance on the test data. Therefore given
test data, we can write it symbolically as:

p(model | test-data),

Using Bayes rule:

p(model | test-data) = p(model) * p(test-data | model) / p(test-data) (1)

For simplifying the model, we assume that p(model) is the same for all models. Since given the test-data, p(test-data) is also the same for all models. As a consequence, the formula is simplified as:

p(model | test-mode) ~ p(test-data | model) (2)

Take the Bigram as a example, given a test-data, the formula (2) can be broken down into multiply components which is less than one. If the test data is very long, the value is computed by formula (2) is very small.  In many practice cases, people use perplexity install of (2), which depicted as:

-log( p(test-data | model) / #(terms) ) (3)

where #(terms) means the number of words in the test data.

As (2) increases, the (3) decreases, which means the best model is the one with the lowest perplexity.

No comments: