Tennis stats

I came across Tennis-data while random-browsing and I thought it would make the perfect data set to experiment with R.

Step 1: data

Download all spreadsheets. Thankfully they all have the same basic structure, one line per match, same columns etc. Most of them have one sheet named the same as the file.

library(gdata)
library(data.table)
years = list.files(pattern='[.]xls')

# this is supposedly the least painful way to do this but WRank got corrupted when i did this. this did not happen with data.table
# library(plyr)
# df=ldply(years,read.xls)

dflist = lapply(years,read.xls)
df = rbindlist(dflist, use.names = TRUE, fill = TRUE)
df = as.data.frame(df)

saveRDS(df,file="tennis.data")

Run this with RScript or simply in a R prompt and it will go through all the excel files and generate a nice R data frame saved in a file named tennis.data.

If you import this in R

df <- readRDS(file="tennis.data")
head(df)

It will look something like this:

Location	Tournament	Date	Series	Surface	Round	Winner	Loser	WRank	WRank	W1	L1	W2	L2	W3	L3
Adelaide	AAPT Championships	2001-01-01	International	Hard	1st Round	Clement A.	Gaudenzi A.	18	101	6	7	6	0	6	3
…
…

Step 2: play

Right so now we can load the data in R from the data file, instead of looping through the spreadsheets every time.

sink(file="/dev/null")
suppressPackageStartupMessages(library(reshape2))
suppressPackageStartupMessages(library(gdata))
df <- readRDS(file="data/tennis.data")

Now that that’s out of the way, this will reformat the few columns we want to play with here:

df$LRank <- suppressWarnings(as.integer(as.character(df$LRank)))
df$WRank <- suppressWarnings(as.integer(as.character(df$WRank)))
df$Date <- as.POSIXct(df$Date, format='%Y-%m-%d')
df$Year <- format(df$Date, format='%Y')

Right so now we can finally play with the data!

# If the first set win means match win, set to 1, otherwise 0
df$Win1 <- ifelse(df["W1"] >= df["L1"],1,0)
# if the winner outranks the loser set to 1
df$Win2 <- ifelse(df["WRank"] <= df["LRank"],1,0)
# if both previous conditions are true, set to 1
df$Win3 <- ifelse(df["Win1"]==1 & df["Win2"]==1,1,0)

Now we just need to write a small function to compute stats, and then one to display them. You’ll see the name of the functions is very creative:

computeStats <- function(data, column) {
    data.subseted = data[,colnames(data) %in% c("Year",column)]
    result = dcast(as.data.frame(table(data.subseted)), as.formula(paste("Year",column, sep ="~")), value.var="Freq")
    result$total = result$"0"+result$"1"
    result <- data.frame(Losses = result$"0",Wins=result$"1",Total=result$total, row.names=result$Year, stringsAsFactors=F)
    result["Total",] = colSums(result)
    result$Pct <- result$Wins/(result$Total)*100
    return(result)
}

displayStats <- function(table) {
    message("======================================================")
    write.fwf(format(table, big.mark=",",zero.print=FALSE,trim=TRUE, digits=4),rownames=TRUE)
    message("======================================================")
    message(paste("It took",format(proc.time()["elapsed"]*1000,digits=5,big.mark=","), "milli-seconds to run this"))
}

Right, so we can display everything now:

message("First set win means match win?")
displayStats(computeStats(df,"Win1"))

message ("  ")
message("Winner outranks loser?")
displayStats(computeStats(df,"Win2"))

message ("  ")
message("First set win means match win, and winner outranks loser?")
displayStats(computeStats(df,"Win3"))

Step 3: Now what

It looks like first set win means match win in ~80% of cases!

First set win means match win?
======================================================
Losses Wins Total Pct
586   2,465  3,051  80.79
583   2,216  2,799  79.17
532   2,268  2,800  81.00
525   2,345  2,870  81.71
519   2,380  2,899  82.10
564   2,332  2,896  80.52
530   2,284  2,814  81.17
504   2,166  2,670  81.12
551   2,164  2,715  79.71
480   2,183  2,663  81.98
491   2,170  2,661  81.55
483   2,183  2,666  81.88
492   2,086  2,578  80.92
501   2,034  2,535  80.24
512   2,109  2,621  80.47
168   735    903    81.40
Total 8,021 34,120 42,141 80.97
======================================================
It took 1,333 milli-seconds to run this

Is the ATP rank a better predictor? Looks like not. This looks like more of a 2 out of 3 type of stat.

Winner outranks loser?
======================================================
Losses Wins Total Pct
1,140  1,906  3,046  62.57
1,026  1,774  2,800  63.36
1,007  1,805  2,812  64.19
1,042  1,829  2,871  63.71
991    1,912  2,903  65.86
1,010  1,889  2,899  65.16
982    1,835  2,817  65.14
895    1,785  2,680  66.60
885    1,840  2,725  67.52
899    1,775  2,674  66.38
874    1,800  2,674  67.31
862    1,814  2,676  67.79
895    1,692  2,587  65.40
823    1,736  2,559  67.84
842    1,779  2,621  67.87
286    620    906    68.43
Total 14,459 27,791 42,250 65.78
======================================================
It took 1,397 milli-seconds to run this

Step 4: Well that was underwhelming

Who cares.

Il n’y a de vraiment beau que ce qui ne peut servir à rien ; tout ce qui est utile est laid, car c’est l’expression de quelque besoin, et ceux de l’homme sont ignobles et dégoûtants, comme sa pauvre et infirme nature. - L’endroit le plus utile d’une maison, ce sont les latrines. – Théophile Gauthier, Mademoiselle de Maupin, préface.