Note: this is just the “written” parts of my report. The notebook GottesmanFinalProject.ipynb
has a lot of code that I spent many, many hours writing as well as well-labeled charts that correspond to the “results” part of this write-up. That notebook contains a more complete and more extensive version of my project, but I am including this write-up as well as a more succinct summary of my project.
Sports are perhaps the world's most popular form of entertainment. Over 40 percent of American children play a sport, and the 24 most-watched television broadcasts of all time are sporting events. Given their popularity and importance to society, it seems logical that we should try our best to make sure sports are as fair as possible, especially professional sports, where players' livelihoods depend on their performances. In the past year, a perceived lack of fairness in baseball due to a cheating scandal involving the Houston Astros has caused uproar and resulted in significant consequences for those involved. Baseball has also been perpetually threatened by usage performance-enhancing drugs (PED) that many think give players an unfair advantage. Accusations of match-fixing by players and officials have historically plagued sports from baseball and basketball to soccer and boxing. Investigations of the phenomenon of "home-field advantage" in sports have suggested that it is at least partially due to the subjectivity of referee, which seems inherently unfair. Golf, however, is often considered one of the fairest sports. The benefit of PED seems to be relatively low and players' salaries are tied directly to their performance, so the risk of cheating seems small in comparison to other sports. There are also no subjective decisions to be made by officials or referees, so the outcome of tournaments is largely objective and based on player behavior. Therefore, it was surprising when golfers at the 2018 U.S. Open at Shinnecock Hills in New York complained that they were receiving unfair treatment compared to their competitors. Even though all golfers theoretically play the same course, these irritated players suggested that this was not true in practice since not all players play at the same time. The golfers who played later in the day suggested that the hot weather during the summer tournament caused the greens to become more dry and making the course more difficult. Golf pundits and television commentators have also taken to this trend, often referring to courses as "baked out" in the afternoon.
Despite this trend, players and pundits alike usually do not offer statistical evidence to back up their claims. I want to take a Bayesian approach to investigate the validity of this popular theory that golf courses become more difficult as the day wears on in hot weather. To do this, I will look at data from the 2018 editions of two of professional golf's four most prestigious tournaments that occur in the late spring and summer, colloquially known as "majors." While the article about the players at Shinnecock Hills was about play on a Saturday, I will use data from the Thursday and Friday rounds of these tournaments since the start times, also known as "tee times," for the weekend rounds depend on the tournament standings and thus are likely correlated with skill, which would bias my results. By attempting to generate and compare models for both morning and afternoon tee times, I will hopefully be able to determine if players truly have a more difficult time later in the day!
This paper suggests that aggregate scores for all players in professional golf tournaments can be appropriately approximated as a Gaussian random variable. Therefore, for each combination of tournament, day of the week, and time of day (morning or afternoon), I want to find a Gaussian posterior with parameters $\mu$ denoting the posterior mean of the scores and $\sigma^2$ denoting the posterior variance. In other words, assuming I only use Thursday and Friday scores for each tournament for the reasons mentioned in the introduction, I will end up with $2*(2t)$ Gaussian posteriors, where $t$ is the number of tournaments I analyze. For simplification of the computations and because sources (here, for example) suggest that doing so results in a reasonable approximation for small numbers of samples, I will approximate the variance as known, and estimate it from the sample standard deviation. Doing this will allow me to employ a more manageable conjugate prior for an “unknown mean, known variance” model since a Gaussian prior is conjugate to such a likelihood. Therefore, we are inferring only the mean parameter ($\mu_{post}$. This idea also makes intuitive sense, since variance in our model will capture the size of the dispersion of scores between players. This is a measurement of the different comparative skill of the tournament participants, which should not depend on the difficulty of conditions. Therefore, my idea is that inherent differences between morning and afternoon difficulty will be captured in different expectations (means) for the morning and afternoon distributions and thus approximating the variance as known will not hold back the investigation.
I will consider the prior mean for a given tournament to be around par. Another method to determine the prior mean would be aggregation of scores from all tournaments on the PGA tour, but given time constraints I felt that using par to be sufficient. For the prior variance, I referenced previous research in this domain. Again looking at the paper from above, which suggests that the standard deviation for the golf scores in their Gaussian in around 3 shots, I selected a prior variance of $3^2 = 9$.
Since I am analyzing two tournaments (the Masters and the U.S. Open), I will eventually generate $2*(2(2)) = 8$ posterior distributions as described above. I then aim to compare these posterior distributions in four pairs; each pair will contain the estimated morning distribution of scores for a certain round and the estimated afternoon distribution for the same round. It is important to notice that my posteriors after this initial analysis will be Gaussians on $\mu_{post}$, the posterior mean. While it is interesting to compare the distributions of the posterior means (and I will do this graphically in my analysis), I am really interested in the distribution of scores, not means. Therefore, I will use the posterior mean to build a posterior predictive distribution, which will show me what scores I expect to get based on my estimate of the mean through the process described above. To compare the posterior predictives for morning and afternoon of the same round, I will take an equal number $n$ samples from each posterior predictive distribution in the pair to generate a two vectors of posterior samples $s_{morning}$ and $s_{afternoon}$. I will then compare these vectors element-wise, and keep a count of how many times $m$ that the element of $s_{morning} < s_{afternoon}$. The proportion $m/n$ then gives an estimate of the probability that a random morning score is lower than a random afternoon score, which I will use as a proxy for the expectation of the probability that the morning round is more difficult than the afternoon round.
As outlined in the formulation section, I chose a Gaussian, “unknown mean, unknown variance” likelihood model. To accompany this likelihood, I chose the conjugate Gaussian prior, which both improves ease of computation because it is conjugate to the likelihood and yields a Gaussian posterior. Many sources (such as here, here, and here) derive the updated posterior parameters for the “unknown mean, known variance” Gaussian likelihood and Gaussian prior, so in the interest of concision I will not repeat that derivation here, but I followed along on my own and confirmed the updated posterior parameters based on the prior and likelihood. After the derivation, I determined that given prior mean $\mu_{0}^{2}$ and variance $\sigma_{0}^{2}$, “known” variance $\sigma^{2}$, the sample mean $\bar {x}$, posterior mean $ mu _{0}'$, and posterior variance ${\sigma _{0}^{2}}'$ are:
$$ {\begin{aligned}{\sigma _{0}^{2}}'&={\frac {1}{{\frac {n}{\sigma ^{2}}}+{\frac {1}{\sigma _{0}^{2}}}}}\\\mu _{0}'&={\frac {{\frac {n{\bar {x}}}{\sigma ^{2}}}+{\frac {\mu _{0}}{\sigma _{0}^{2}}}}{{\frac {n}{\sigma ^{2}}}+{\frac {1}{\sigma _{0}^{2}}}}}\\{\bar {x}}&={\frac {1}{n}}\sum _{i=1}^{n}x_{i}\end{aligned}} $$As I mentioned before, however, I am more interested in the posterior predictive than the posterior itself. Finding the posterior predictive for the “unknown mean, known variance” Gaussian is another well-established problem, and there are many explanations of the solution, including this well-done video. In the video, it is shown how the posterior predictive is simply the marginal density $P(x’ | x)$, where $x’$ is the a potential new observation you want to predict and $x$ is the known observations. To evaluate this, you need to evaluate the integral $$ \int_{-\infty}^{\infty} P(x’, \mu_{post} | x) \ d\mu_{post} $$
As shown in the video, however, you actually don’t need to evaluate the integral. Instead, you can show that the integral is the same as the sum of two Gaussians, which we can solve with our knowledge of Gaussian identities. In the end, the posterior predictive is a Gaussian with the same mean as the posterior, but a variance that is the sum of the “known” variance from the likelihood and the posterior variance.
Once I figured out these parameters for my posterior predictive, I wrote a few functions to evaluate the posterior and posterior predictive for a point and generate plots of the posteriors, prior, and posterior predictives. These functions proved useful for my next step: comparing samples between the morning and afternoon posterior predictives with MCMC.
I used very similar MCMC code to the ones that we used in class, writing a function generating samples of the posterior predictive distributions for a certain tournament, day, and time of day. Based on this recommendation, which we also talked about in lecture, I then tuned the proposal variance for each iteration of MCMC to try to get an acceptance rate between 23% and 50% for each posterior predictive. Once I had samples from MCMC for the morning and afternoon rounds of the same day, I wrote a function to estimate the probability that the afternoon scores were higher than the morning scores. This function simply compared the $i$th elements of the morning and afternoon samples and counted how many of the afternoon samples were larger than the corresponding morning sample and then divided that sum by the total number of samples to estimate a probability that morning scores were higher than those in the afternoon.
I developed a wide variety of software to come to these solutions, which can be found in the GottesmanFinalProject.ipynb
notebook.
Because of the design of my project, I basically solved eight inference problems (to solve for the value of the posterior mean for each tournament, day, and time of day). In the end, I used the results of these inference problems to define posterior predictive densities and compared these via MCMC sampling to yield my four “final” results, which were probabilities that a random sample afternoon score was greater than a random sample mornign score. I have including plots of the posteriors, posterior predictives, MCMC iterations, and sample differences for all the analyses in the GottesmanFinalProject.ipynb
notebook and won’t reproduce them all here for the sake of readability, but below is an example for the Thursday round at the U.S. Open. Again, you can see all the plots in the GottesmanFinalProject.ipynb
notebook.
Running MCMC... On step 1000/5000, accept rate = 39.2% On step 2000/5000, accept rate = 38.8% On step 3000/5000, accept rate = 38.9% On step 4000/5000, accept rate = 39.2% Final Acceptance Rate = 39.4 % Running MCMC... On step 1000/5000, accept rate = 35.4% On step 2000/5000, accept rate = 35.7% On step 3000/5000, accept rate = 36.8% On step 4000/5000, accept rate = 36.5% Final Acceptance Rate = 36.5 %
In terms of my final results themselves, the probabilities are a slightly mixed bag. The following chart summarizes the probabilities I derived for each round, which constitutes the final results of my analysis. For intermediate results, see the plots contained in `GottesmanFinalProject.ipynb.
Tournament | Round | Probability afternoon score is higher |
---|---|---|
U.S. Open 2018 | Thursday | 0.5224 |
U.S. Open 2018 | Friday | 0.4464 |
Masters 2018 | Thursday | 0.4616 |
Masters 2018 | Friday | 0.4614 |
A naive way to estimate a single final result might be to average these probabilites, which yields:
$$ \frac{0.5224 + 0.4464 + 0.4616 + 0.4614}{4} = \boxed{0.47295} $$My final result of the probability an afternoon score is greater than a morning score across the four summer tournament rounds certinaly lays the groundwork for a more detalied analysis of how the difficulty of afternoon major rounds compare to morning major rounds in golf. I intended my final esimate of 0.47295 to represent the probability that afternoon major rounds are more difficult than morning major rounds. Given time constraints with both the difficulty of collecting my data and the amount of steps I used in my analysis, I was unforuntaely limited to the small sample size of two tournaments. Therefore, I think it would be naive to suggest that that esimate is a definitive answer to my question, but it is certainly a start and worth interpretation. The fact that three out of four probabilities based on indiviudal round data (as well as the final averaged result) suggest that morning rounds are more difficult than the afternoon round does not tell me for sure that morning rounds are more difficult than afternoon rounds. It does tell me, however, that afternoon rounds are almost certainly not always significantly more difficult than morning rounds, or else I would not expect to get a majority of results that suggest the opposite. Furthermore, I might be slightly more confident in my analysis if I relaxed some of the assumptions, such as estimating the variance of the model as "known" and using picking my prior parameters based on a single source of previous research. Although I think these methods were reasonable given the time constraints, I think using an unknown variance model and adding hierarchical analysis to esimate my prior mean and variance would make me more confident in my final results. Additionally, I made some assumptions to simplify my analysis based on time constraints that I might want to correct in a subsequent analysis. These include not factoring in weather and temperature into the analysis, not controlling for player skill (perhaps based on previous tournament scores) and instead assuming equal player skill in the morning and afternoon tee times, assuming the conditonal independence of observations, and perhaps more assumptions I made implicitly and did not catch.
Ultimately, my hypothesis is that the morning and afternoon rounds are of neglibly different difficulty. In other words, I expect the average probability of the afternoon round being more difficult to converge to nearly 0.5 as more and more rounds are analyzed. While my small sample size does not confirm this hypothesis, it certianly does not disprove it since I found probabilties on both sides of 0.5 and the final result was only about 5% different than 0.5. I would love to continue this analysis with more tournaments, controlling for more assumptions, and potentially a more complicated model and prior to see how my hypothesis fares.