“Most Likely” is an All or Nothing Proposition
The principle of maximum likelihood estimation is generally not explained well; readers are made to believe that it should be obvious to them that choosing the “most likely outcome” is the most sensible thing to do. It isn’t obvious, and it need not be the most sensible thing to do.
First, recall the statement I made in an earlier paper:
The author believes firmly that asking for an estimate of a parameter is, a priori, a meaningless question. It has been given meaning by force of habit. An estimate only becomes useful once it is used to make a decision, serving as a proxy for the unknown true parameter value. Decisions include: the action taken by a pilot in response to estimates from the flight computer; an automated control action in response to feedback; and, what someone decides they hear over a mobile phone (with the pertinent question being whether the estimate produced by the phone of the transmitted message is intelligible). Without knowing the decision to be made, whether an estimator is good or bad is unanswerable. One could hope for an estimator that works well for a large class of decisions, and the author sees this as the context of estimation theory.
Consider the following problem. Assume two coins are tossed, but somehow the outcome of the first coin influences the outcome of the second coin. Specifically, the possible outcomes (H = heads, T = tails) and their probabilities are: HH ; HT ; TH ; TT . Given these probabilities, what is our best guess as to the outcome? We have been conditioned to respond by saying that the most likely outcome is the one with the highest probability, namely, HH. What is our best guess as to the outcome of the first coin only? Well, there is chance it will be H and chance it will be T, so the most likely outcome is T. How can it be that the most likely outcome of the first coin is T but the most likely outcome of both coins is HH?
The (only) way to understand this sensibly is to think in terms of how the estimate will be used. What “most likely” really means is that it is the best strategy to use when placing an all-or-nothing bet. If I must bet on the outcome of the two coins, and I win $1 if I guess correctly and win nothing otherwise, my best strategy is to bet on HH. If I must bet on the outcome of the first coin, the best strategy is to bet on T. This is not a contradiction because betting on the first coin being T is the same as betting on the two coins being either TH or TT. I can now win in two cases, not just one; it is a different gamble.
The above is not an idle example. In communications, the receiver must estimate what symbols were sent. A typical mathematical formulation of the problem is estimating the state of a hidden Markov chain. One can choose to estimate the most likely sequence of states or the most likely state at a particular instance. The above example explains the difference and helps determine which is the more appropriate estimate to use.
Finally, it is noted that an all-or-nothing bet is not necessarily the most appropriate way of measuring the performance of an estimator. For instance, partial credit might be given for being close to the answer, so if I guess two coins correctly I win $2, if I guess one coin correctly I win $1, otherwise I win nothing. This can be interpreted as “regularising” the maximum likelihood estimate. Nevertheless, at the end of the day, the only way to understand an estimator is in the broader context of the types of decisions that can be made well by using that estimator.