Showing blog posts sorted under the tag: julia

Simple Regression Trees in Julia

Being a data analyst, it’s a bit embarrassing how little experience I have with the new hotness of machine learning. I recently had a conversation with an individual who mentioned that they often employ decision and regression trees as a data exploration method and this prompted me to start looking into them.

Decision and regression trees are an awesome tool because of how transparent the end result is. It’s easy to understand and to explain to others who might be weary of implementing something opaque. In the simplest trees, they ask a series of yes-no questions such as: is a certain variable greater than some number. With each question you progress through different paths until you reach a terminal node. This terminal node will give you a prediction, either a classification or a value, depending on the type of tree. This process is extremely easy to follow and is the biggest selling point of decision trees.

Another advantage of decision trees is the simplicity of the algorithm used to create the tree. There are 3 basic steps that go into creating a tree. The first is a calculation on some cost function that we want to minimize. In the case of regression trees the cost function is usually just the mean squared error of all observations at that particular node. Secondly, each variable is iterated over to find the optimal way to divide the observations into two groups. Optimal, in this case, refers to the smallest mean squared error. And finally, once the optimal division is found the process is repeated on the two subgroups. This continues until certain predefined conditions are reached like minimum number of observations at a node.

In fact, the algorithm is so simple I decided to implement a basic regression tree in Julia as a learning exercise. Julia is an awesome statistical computing language thats main advantage is speed. Code written in Julia is often several times faster than the equivalent R or Python code for non-trivial calculations. My implementation is rather limited compared to the ‘rpart’ package in R or even the ‘DecisionTrees.jl’ package available in Julia. The idea was to gain a better understanding of how decision trees actually work and not to replace any of the already great implementations available.

I tested my implementation on the 'cu.summary' dataset from 'rpart'. This dataset contains information on a small number of cars and regressing on mileage gives the following tree:

Price < 9415.84 : 1
  Price < 6696.9 : 2
    4 : 34.0 : 3
    7 : 30.714285714285715 : 3
  Type IN String["Small","Sporty","Compact"] : 2
    Price < 11475.8 : 3
      Reliability IN String["average","Much worse"] : 4
        4 : 27.25 : 5
        6 : 24.166666666666668 : 5
      Reliability IN String["Much worse","better"] : 4
        4 : 21.0 : 5
        7 : 24.428571428571427 : 5
    Type IN String["Medium"] : 3
      Reliability IN String["Much better","worse"] : 4
        6 : 21.333333333333332 : 5
        5 : 22.2 : 5
      6 : 19.333333333333332 : 4

The labels show the decision that is made at each node. The lines that begin with a number show the number of observations that were placed in that bin along with the average mileage of those observations. The output isn’t pretty but it isn’t that difficult to follow since the tree is pretty shallow.

And, as always, I’ve uploaded my code to Github.


Using Julia for Dynamic Charts

The more I play around with Julia the better things seem to get. With web-stuff on my mind lately from working on MiniHTTP I thought it would be cool to create a simple web API for adding charts to any webpage. Thanks to a couple packages (HttpServer.jl and Plots.jl) this was dead easy in Julia.

The basis for handling requests for charts is by creating a simple HTTP Handler. This checks the URL that was supplied and, if it matches, dispatches the correct chart.

http = HttpHandler() do req::Request, res::Response
  if ismatch(r"^/randomchart",req.resource)
    Response(random_chart(), headers("image/png"))
  elseif ismatch(r"^/linechart",req.resource)
    Response(line_chart(req.resource), headers("image/png"))

Above there are two endpoints that we can match. The first one is if the URL starts with ‘/randomchart’, in this case the function ‘random_chart()’ is called which generates some random numbers and plots them to a chart. The chart at the top of this post in made this way and, since it is random, every time this page is refreshed the chart will be slightly different.

The second endpoint is for generating a custom line chart. If the URL starts with ‘/linechart’ then the function ‘line_chart(::AbstractString)’ is called. This function expects that a query string is attached to the end of the supplied URL that contains a list of x and y coordinates to plot on a line graph. For instance, the URL ‘/linechart?x=1,2,3,4,5,6,7,8,9,10&y=1,2,2.5,2,3,1,1.5,3,0.5,2’ was used to generate the plot below.

And that’s really all there is to it. The full code (which is less than 50 lines!) has been uploaded to my Github if you would like to take a further look. With a little extra work, this could be fleshed out to a very nice and easy-to-use web graphing API for people who just want to be able to quickly add some charts to a website.

Update(20-02-2017): I've shut down one of my Amazon EC2 instances so the charts shown above are now static images instead of dynamically generated from Julia.


Objective Gametypes and KDA

Something that often gets said regarding objective gametypes (capture the flag, strongholds, etc.) is that KDA (kill-deaths-assists) does not matter as long as you're getting the objectives. So I wanted to dig through some of my recent Halo 5 matches to see if there's any truth to this statement. Using a sample of recent objective games that I completed, a logistic regression was performed using Kills, Assists, and Deaths on the dependent variable of winning or losing. Basically I want to know how much each of these variables influences the probability of winning or losing a game.

This was also a bit of an excuse to play around some more with the language Julia. I've been starting to use it more at work and am really enjoying what it offers. Although it's tough to compete with the whole R ecosystem, thanks to the RCall package it makes the transition pretty painless. Head over to Julia-Lang if you want to know more.

First the non-Julia stuff. Since I want Halo 5 data I'll be using RCall to interface with my R package to get some data. Below is a simple wrapper function to get the match data I want using my h5api package.

using DataFrames, RCall, GLM

api_key = "itsamystery"
slayer_id = "257a305e-4dd3-41f1-9824-dfe7e8bd59e1"

function getRecentMatches(player, modes, start, count, key)
  R"recent_matches <- getRecentMatches(player = $player,
                                        modes = $modes,
                                        start = $start,
                                        count = $count,
                                        key = $key)"
  r_data = R"cbind(rbindlist(lapply(recent_matches$Results$Players, flatten)),
  return rcopy(DataFrame, r_data)

Next, just using a simple loop I'll grab data from 250 of my most recent matches. CLWakaLaka is my gamertag, so you can either use mine again or try your own if you play Halo 5.

match_data = DataFrame()
for i in 0:9
  match_data = [
    getRecentMatches("CLWakaLaka", "arena", i*25, 25, api_key)

This performs a little cleaning up before performing the regression. Besides a win or a lose, it's possible for a match to end in a tie or disconnect. So first only definite win/lose matches are considered. Slayer games are also filtered out. Since kills are the objective in this game type it goes without saying that KDA directly influences your likelihood of winning. Finally the result is changed to a simple 1-0 variable. 1 meaning win and 0 meaning lose.

match_data = match_data[((match_data[:Result] .== 1) | (match_data[:Result] .== 3)) & (match_data[:id] .!= slayer_id), :]
match_data[:Result] = map(match_data[:Result]) do x
  if x == 3
    return 1
    return 0
match_data[:Result] = convert(Array{Float64, 1}, match_data[:Result])

Finally, using the GLM package, a logistic regression is performed in the data with the following results:

my_lm = glm(Result ~ TotalDeaths + TotalAssists + TotalKills,
            Binomial(), LogitLink())

Formula: Result ~ 1 + TotalDeaths + TotalAssists + TotalKills
               Estimate Std.Error   z value Pr(>|z|)
(Intercept)   0.0405304  0.531324 0.0762819   0.9392
TotalDeaths   -0.202424 0.0572149  -3.53796   0.0004
TotalAssists   0.202192 0.0941433   2.14771   0.0317
TotalKills     0.121854  0.049335   2.46993   0.0135

So what does this tell us? Well the logistic function looks like this: 1 / (1 + exp(-( intercept + b1x1 + b2x2 + etc. ))) Where bi's are the estimates above and xi's are the data points. For example, if I had a game with 10 kills, 4 assists, and 8 deaths, then my estimated probability of winning that game would be: 1 / (1 + exp(-( 0.0405 -0.2028 +0.2024 + 0.122*10 ))) = 0.61 or 61%

From the estimates above: deaths negatively influence the probability of a win and kills and assists influence the probability of a win positively. All three variable estimates have a p-value of less than 0.05 which suggests they are significant factors in the overall outcome of a game (obviously). The intercept, however, is not significant which makes sense since we likely have no data for a 0/0/0 game.

Interestingly, deaths and assists coefficients have roughly equal magnitude while the coefficient for kills is slightly less. This would suggest that the relative importance of these actions corresponds to Deaths = Assists > Kills. Meaning that the most important factors towards getting the win are (in this order): not dying, always shooting/helping teammates, then getting kills.

So there we go, it seems there is a little kernel of truth in the idea that KDA in an objective-based gametype is not everything... Although it certainly helps, and feels so good.