Tennis stats
I came across Tennis-data while random-browsing and I thought it would make the perfect data set to experiment with R.
Step 1: data
Download all spreadsheets. Thankfully they all have the same basic structure, one line per match, same columns etc. Most of them have one sheet named the same as the file.
library(gdata)
library(data.table)
years = list.files(pattern='[.]xls')
# this is supposedly the least painful way to do this but WRank got corrupted when i did this. this did not happen with data.table
# library(plyr)
# df=ldply(years,read.xls)
dflist = lapply(years,read.xls)
df = rbindlist(dflist, use.names = TRUE, fill = TRUE)
df = as.data.frame(df)
saveRDS(df,file="tennis.data")Run this with RScript or simply in a R prompt and it will go through all the excel files and generate a nice R data frame saved in a file named tennis.data.
If you import this in R
df <- readRDS(file="tennis.data")
head(df)It will look something like this:
| Location | Tournament | Date | Series | Surface | Round | Winner | Loser | WRank | WRank | W1 | L1 | W2 | L2 | W3 | L3 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Adelaide | AAPT Championships | 2001-01-01 | International | Hard | 1st Round | Clement A. | Gaudenzi A. | 18 | 101 | 6 | 7 | 6 | 0 | 6 | 3 |
| … | |||||||||||||||
| … |
Step 2: play
Right so now we can load the data in R from the data file, instead of looping through the spreadsheets every time.
sink(file="/dev/null")
suppressPackageStartupMessages(library(reshape2))
suppressPackageStartupMessages(library(gdata))
df <- readRDS(file="data/tennis.data")Now that that’s out of the way, this will reformat the few columns we want to play with here:
df$LRank <- suppressWarnings(as.integer(as.character(df$LRank)))
df$WRank <- suppressWarnings(as.integer(as.character(df$WRank)))
df$Date <- as.POSIXct(df$Date, format='%Y-%m-%d')
df$Year <- format(df$Date, format='%Y')Right so now we can finally play with the data!
# If the first set win means match win, set to 1, otherwise 0
df$Win1 <- ifelse(df["W1"] >= df["L1"],1,0)
# if the winner outranks the loser set to 1
df$Win2 <- ifelse(df["WRank"] <= df["LRank"],1,0)
# if both previous conditions are true, set to 1
df$Win3 <- ifelse(df["Win1"]==1 & df["Win2"]==1,1,0)Now we just need to write a small function to compute stats, and then one to display them. You’ll see the name of the functions is very creative:
computeStats <- function(data, column) {
data.subseted = data[,colnames(data) %in% c("Year",column)]
result = dcast(as.data.frame(table(data.subseted)), as.formula(paste("Year",column, sep ="~")), value.var="Freq")
result$total = result$"0"+result$"1"
result <- data.frame(Losses = result$"0",Wins=result$"1",Total=result$total, row.names=result$Year, stringsAsFactors=F)
result["Total",] = colSums(result)
result$Pct <- result$Wins/(result$Total)*100
return(result)
}
displayStats <- function(table) {
message("======================================================")
write.fwf(format(table, big.mark=",",zero.print=FALSE,trim=TRUE, digits=4),rownames=TRUE)
message("======================================================")
message(paste("It took",format(proc.time()["elapsed"]*1000,digits=5,big.mark=","), "milli-seconds to run this"))
}Right, so we can display everything now:
message("First set win means match win?")
displayStats(computeStats(df,"Win1"))
message (" ")
message("Winner outranks loser?")
displayStats(computeStats(df,"Win2"))
message (" ")
message("First set win means match win, and winner outranks loser?")
displayStats(computeStats(df,"Win3"))Step 3: Now what
It looks like first set win means match win in ~80% of cases!
First set win means match win?
======================================================
Losses Wins Total Pct
2001 586 2,465 3,051 80.79
2002 583 2,216 2,799 79.17
2003 532 2,268 2,800 81.00
2004 525 2,345 2,870 81.71
2005 519 2,380 2,899 82.10
2006 564 2,332 2,896 80.52
2007 530 2,284 2,814 81.17
2008 504 2,166 2,670 81.12
2009 551 2,164 2,715 79.71
2010 480 2,183 2,663 81.98
2011 491 2,170 2,661 81.55
2012 483 2,183 2,666 81.88
2013 492 2,086 2,578 80.92
2014 501 2,034 2,535 80.24
2015 512 2,109 2,621 80.47
2016 168 735 903 81.40
Total 8,021 34,120 42,141 80.97
======================================================
It took 1,333 milli-seconds to run this
Is the ATP rank a better predictor? Looks like not. This looks like more of a 2 out of 3 type of stat.
Winner outranks loser?
======================================================
Losses Wins Total Pct
2001 1,140 1,906 3,046 62.57
2002 1,026 1,774 2,800 63.36
2003 1,007 1,805 2,812 64.19
2004 1,042 1,829 2,871 63.71
2005 991 1,912 2,903 65.86
2006 1,010 1,889 2,899 65.16
2007 982 1,835 2,817 65.14
2008 895 1,785 2,680 66.60
2009 885 1,840 2,725 67.52
2010 899 1,775 2,674 66.38
2011 874 1,800 2,674 67.31
2012 862 1,814 2,676 67.79
2013 895 1,692 2,587 65.40
2014 823 1,736 2,559 67.84
2015 842 1,779 2,621 67.87
2016 286 620 906 68.43
Total 14,459 27,791 42,250 65.78
======================================================
It took 1,397 milli-seconds to run this
Step 4: Well that was underwhelming
Who cares.
Il n’y a de vraiment beau que ce qui ne peut servir à rien ; tout ce qui est utile est laid, car c’est l’expression de quelque besoin, et ceux de l’homme sont ignobles et dégoûtants, comme sa pauvre et infirme nature. - L’endroit le plus utile d’une maison, ce sont les latrines. – Théophile Gauthier, Mademoiselle de Maupin, préface.