Tag

Showing blog posts sorted under the tag: R

A Year of Job Postings from the Yukon Government

Preamble

Skip ahead to the next section if you just want to see some pictures

About a year ago, I unfortunately had to turn down a job in Whitehorse with the Yukon Government. So to make sure that I didn't miss any postings in the future, I wrote a short R script to scrape the Yukon's job listings page.

Every morning the script would run on my local Raspberry Pi server and see if there were any new job listings. If there were, it would send me an email with some basic information like title and department. In addition to emails, the script also saved new listings in an SQLite database for preservation. This is also how the script knew that a job post was new. It compared the scraped job postings to those already saved in the database.

Just anecdotally, I noticed the daily emails stopped for a while around April. I guess recruitment was put on hold during the early days of Corona lockdown. Once the emails did resume, it was mainly medical positions, like nurses etc.

Sadly, no suitable position popped up during the past year but, being a data analyst, I didn't want all the data I scraped to go to waste. So I thought it would be interesting to take a look at a few stats about Yukon Government job postings over the past year.

Note: Just to preface these numbers, when I talk about a new job listing what I mean is a job listing with either a new ID or a new closing date. I chose this definition because I noticed that sometimes the same job listing would get re-posted with a new closing date. Also, I think some job listings can be for multiple open positions. So a new posting doesn't necessarily equate to only a single job opening.

Totals

In total, I collected 440 job postings throughout 2020. I started collecting listings in late January 2020 until the end of December 2020. So not quite a full year but most of the postings from the beginning of January were probably gathered in the initial scraping. So I would say it's roughly the full year.

51 job listings contained the title 'nurse' or 'RN', by far the most common. This means over 10% of job listings are for different types of nurses. Become a nurse if you want an easy time finding work in the Yukon.

36 job postings were re-listings (same ID but different closing date). I don't know the specifics of why each posting was re-listed, it could be a lot of reasons. However, one reason might be that the position was unable to be filled and maybe this means that these are difficult-to-fill positions.

Of these 36 re-postings, 7 (almost 20%!) were nurses or RNs. Did I already mention you should become a nurse if you want to find work in the Yukon? Several others were generally high-level positions like: directors, managers, supervisors, etc. And some were more specialized jobs like Infection Control Coordinator.

Cities


It shouldn't be much of a surprise that the vast majority of job postings are for positions in Whitehorse, followed by Dawson City at a distant second.

Job postings that contained multiple locations got counted for each city. So the percentages shown add up to more than 100%. There were quite a few job postings applicable to more than one location.

Departments


Here we see the number of job postings aggregated by department. I left out sub-departments because there are too many and would clutter the chart.

Similar to locations, departments are also quite skewed towards a few big ones. With Corona putting a freeze on a lot of hiring, then resuming only health-related recruitment, it makes sense that the Health and Social Services department had the most job postings in the last year. Highways and Public Works also had quite a few open positions this year. Education having so few job postings surprised me a little. I thought it would have been higher.

It would be very interesting to see how this compares to previous non-Coronavirus years.

All Job Postings

Just for fun, below is a sunburst plot with all the job postings I collected grouped by department and then job title. Try clicking on a department to view all the job postings under them over the past year.

Data and Scripts

For anybody interested in the data I collected, or the scripts I used, visit the project on my GitHub. I have the job postings there in CSV format as well if somebody wants to take a look themselves.

One thing I didn't collect (that I really should have, had I had more forethought) was the day each job posting was listed. It would have been cool to see recruitment of different department change drastically week-by-week during the lockdown.

Thanks for reading!


Tags:


Historical Linux Statistics from Steam's Hardware & Software Survey

Steam is a popular gaming storefront and platform on Window, macOS, and Linux. Every month they publish their Hardware & Software Survey with overall summary statistics of their users. The data can also be segregated by operating system.

I wanted to see how the share of Linux has changed over time on Steam. Unfortunately, the survey data is only ever available for the previous month. So I wrote a small R script (GitHub link) to scrape historic survey results from the internet archive's Wayback Machine and current data directly from Valve. A few months were missing from the Wayback Machine, which was a bummer, but enough data was available to get a feel for how the metrics have changed over time.

Steam Survey Linux User Percentage for this month

Linux User Percentage

The drop in Linux user share in late 2017 to early 2018 was due to a combination of factors. First was a counting error in the survey that Valve admitted to and later fixed. The error resulted in over-inflated user numbers from net cafes. Additionally, this time period was the peak player count for the hugely successful title PlayerUnknown's Battlegrounds. PUBG brought a lot of new players to the Steam platform from regions where net cafes are a popular way to play games. Both of these factors combined to substantially deflate the Linux user share on Steam.

Because of overall growth of Steam however, a drop in Linux share does not necessarily mean an absolute drop in Linux players. In Steam 2019 Year in Review, they mention and monthly active user count of almost 95 million. That equates to about 850k monthly Linux players during 2019.

Steam Survey Linux User Percentage

Linux User Percentage by OS Language

Restricting to reported OS language shows some interesting results with regards to the Linux share on Steam. The percentage of Linux users, of those with an English language OS, is around twice that of the general population. It's unclear whether this difference is due more English speakers preferring Linux, or more Linux users preferring English. But it's a surprising difference non-the-less.

Steam Survey Linux User Percentage by OS Language

Processor Preference of Linux Users

Since the release of AMDs Ryzen CPU line in early 2017, more and more Linux users have been foregoing Intel processors in favour of AMD. However, Intel still has a clear market lead.

Steam Survey Linux User Processor Vendor Usage

GPU Vendor Preference of Linux Users

AMD is taking up ground in the GPU space among Linux users on Steam. The results of AMDs open source initiative began to bear fruit in 2017/18 as game performance approached some of Nvidia's offerings.

Despite a closed-source video driver, Nvidia still remains the main choice among Linux users.

Steam Survey Linux User GPU Vendor Usage

Most Popular Linux Distros

Steam (unfortunately) does not report many different Linux distributions, preferring to group most in the 'Other' category.

Ubuntu remains the most popular Linux distribution on Steam and many Linux games specifically target Ubuntu as a supported OS. This has the effect of generally being the smoothest experience for new users.

Outside of Ubuntu, there is a great variety of Linux distributions, many of which will also have no issues running games on Steam.

Steam Survey Linux Top Distros

In addition to the snapshot of data above, I've setup a page on my Linux gaming blog with the same charts that are updated automatically, with an R script, whenever new data becomes available.


Tags:


Overlaying Frames-per-Second on a Benchmark Video Using R, ffmpeg, and Kdenlive


Feral Interactive is a UK-based porting house that specializes in bringing Windows games to other platforms like Linux and macOS. One of their most recent projects was bringing the game Shadow of the Tomb Raider to Linux. I wanted to compare the performance of their native Linux version of the game versus running it in Linux using a popular compatibility layer called Wine. Running games on Linux with Wine often incurs some performance cost compared to Windows so there is still a market for native Linux ports that can recover some of that lost performance.

Conveniently, Shadow of the Tomb Raider contains a built in benchmark tool that will spit out its results to a text file where it can then be analyzed with R. The raw data looks a little like this:

  frame  time delta memory
  <int> <dbl> <dbl>  <dbl>
1     1   0     0     2341
2     2  14.4  14.4   4462
3     3  35.7  21.3   4462
4     4  53    17.3   4462
5     5  72.1  19.1   4462
6     6  91.6  19.5   4462

Frame is the id of the current frame, time is the milliseconds since the start of the benchmark, and delta is the amount of time it took to draw the frame. Most gamers don't really care about these numbers though; the most relatable metric is frames-per-second which is the number of frames that are able to be drawn in one second. To calculate this I just look a the time it took to draw the previous 50 frames, then 50 divided that time is the rolling FPS.

With FPS calculated, it's easy to use R and ggplot2 to make a nice graph showing the performance of the benchmark over time.


That's neat, but what I really wanted was to overlay the chart over footage of the actual benchmark so that people could see how different in-game scenes effect the frames-per-second. To do this I used a few tools: R again for the chart generation, ffmpeg to turn pictures into a video, and then Kdenlive to edit the video.

Generating Charts:

To embed a moving chart in a video, I used R and ggplot2 to generate 1 chart per video frame. That works out to 4000 individual charts due to the benchmark being 160 seconds long and wanting 25 frames per second. Each new frame shifts a window showing the next 1/25th of a second of data and 10 seconds worth data over the whole image.

To make things look a bit nicer in the final video, the background of the charts had to be a colour that could easily be chroma keyed out. Chroma keying can remove a certain colour from a video layer, basically green screening. So all 4000 charts looked something like the following beautiful image.


Turning Charts into a Video:

Thankfully, turning a series of images into a video is rather common problem and there are a lot examples online of using ffmpeg to do this conversion. So I shameless borrowed the following command to turn all 4000 charts into a video. I won't pretend to know what all of the arguments do, but importantly it is set to 25 frames-per-second to match the timing of the generated charts. Without this the scrolling chart would be too fast or too slow and would not line up with the benchmark footage.

ffmpeg -r 25 -f image2 -start_number 1 -i plots/fps_%d.png -vcodec libx264 -profile:v high444 -crf 0  -pix_fmt yuv420p sottr_fps.mp4

Overlay FPS Video on top of Benchmark:

Kdenlive is an open-source video editor for Linux. Video editing is still one of the areas of desktop Linux that is still a bit lacking, but Kdenlive crucially has a chroma key feature which is the key component in this step. The video generated from the bright green charts is overlayed on the footage of the Tomb Raider benchmark then the chroma key is applied.

In this screenshot you can see the chroma key effect being applied to the bright green of the chart video. It removes the background and turns it into a very nice looking overlay.


So that's it. I really enjoyed this little project because it was the combination of several tools (R, ffmpeg, and Kdenlive) that really made it possible. Each had a specific task and it all came together nicely.

Check out the final result on YouTube.


Tags:


Bitcoin Prices and Hidden Markov Models

Lately, there’s been a lot of interest in Bitcoin, probably sparked by its almost unbelievable growth in December 2017. However, this past week, we saw the price of Bitcoin drop the just above $6000 which was the lowest it has been since November 2017. So I wanted to take a closer look at Bitcoin prices through the lens of Hidden Markov Models (HMM) to see what conclusions, if any, can be drawn.

Hidden Markov Models are similar to a standard Markov chain model but the where the current state is unknown. Instead of observing the actual state of the process, the only information available is the realization of some other output that is dependent on the current internal state. A somewhat contrived example would be trying to detect whether it is raining, or not, based on how many people you see with umbrellas. The hidden, unobservable state is the weather (raining or not) while the observable, realization of that state is the proportion of people carrying umbrellas (more people carry umbrellas if it’s raining).

Applying this concept to Bitcoin prices, there could be some internal state driving the change in price and different states produce different expected price changes. I assumed that the daily change in price followed a Log-normal distribution, which means that taking the logged value of daily returns should be normally distributed. This made the model slightly easier to interpret. I also used 3 internal states in an attempt to capture bear and bull states with differing volatility.

Below is a chart showing the most likely states during the 2017 and into the 2018 calendar years:


Here each of the three states are coloured. The blue state was characterized by positive average returns and low volatility. The red state also had generally positive returns but higher volatility. Finally, the green state had mostly negative returns and also high volatility.

I also ran a quick Shapiro-Wilk test on the log-valued daily returns which was unable to reject the null that daily returns come from a normal distribution. This means that there wasn’t enough evidence to disprove the assumption that price changes follow a Log-normal distribution.

This is all fine and good, but what would be really cool is if the fitted model could be used to predict the future price of Bitcoin. So I ran 10,000 30-day simulations to get an expected future price and a confidence interval. This is what it looks like:


This shows the predicted Bitcoin price, and the actual price change during the prediction interval. The shaded regions also represent the 95% and 80% confidence intervals, based on the 10,000 simulations. In this instance, the HMM was not exactly a great predictor. Bitcoin has been incredibly volatile and I think it’s extremely difficult to make any meaningful predictions using closing price alone.

If you’re interested in taking a closer look at the R code used to fit the HMM model and generate the charts, you can find it on my Github.


Tags:


Canadian 2016 Census - Population and Dwellings

View everything here

About a month ago, Statistics Canada finally started releasing summary statistics for the 2016 census. The long-form census was re-introduced last year so over the course of this year there should be lots of interesting data to look through. As of right now, only information on population and dwelling counts has been released with age and sex demographics scheduled for the beginning of May 2017.

I wanted to play around with a couple new tools like leaflet and highcharts and the census population data was the perfect test dataset. Leaflet is an awesome mapping library that feels really snappy in a browser and the R wrapper is incredibly simple to use. I definitely recommend it for any type of geographic visualizations. Flexdashboard was used to create a single-page html file that I then hosted on my webserver.

I don't have much to say about the data since I'm not really in a position to make any kind of conclusions. It's mostly just interesting to see how things have changed in Canada over the past 5 years.


Tags: