Installing VMWare Player 15.5 on Ubuntu

Published: 2020-01-19 1 minute read

VMWare Player has a bug during the installation process, where you get through accepting the license agreement but pressing the Finish button doesn't complete the installation - it just does nothing. Several results on Google implied that it was something to do with needing root privileges, but the exact solutions didn't work for me.

When I used sudo vmplayer I experienced the same bug, but after becoming the root user with sudo su and then running vmplayer finally completed the install. I'm not really sure what's going on there. Afterwards I could run VMWare Player as my non-privileged user and it worked fine.

Lessons learned while adopting Snakemake

Published: 2020-01-12 4 minute read

Recently, I decided to adopt the Snakemake workflow system, as I had started to hit some pain points in my project that it reportedly solves. While I believe that this was ultimately a great decision in the long term, it did take me a week to fully move things over when I had expected it to just take a day. Much of this work was adapting my code to the framework and was really only necessary because I was not adhering to the Unix philosophy in the first place, which is entirely my fault, but Snakemake fortunately incentivizes this. That said, it was very frustrating at points, which I think is a combination of the inherent complexity of the problem it's solving, some documentation issues, and unhelpful error messages. I decided to describe these problems and how I solved them.

Input referenced a variable that wasn't referenced by the output

This was the most conceptually surprising part of the framework for me. I had assumed that if you had a chain of rules, that one rule's input could reference the output of the previous rule using a pattern. However, this is only true if any variables that the input section has are also used in the output! Even when directly using rules.some_rule.output as the input, Snakemake couldn't figure out what I meant, since it had a variable {run} that I wasn't using in the output filename. The solution was to refactor how I was building up the file paths, so that the unique thing that identifies each dataset is the only string represented by the variable.

A pseudo-rule that lists every file you want to make is required

One of the more confusing aspects of Snakemake to me (which effectively explains the previous issue) is that while it can use globbing to infer all the inputs to a rule, it can only do that because the outputs are explicitly defined my subsequent rules, with an all-encompassing pseudo-rule (usually called rule all) at the very end. My mental framework going in was working in the opposite direction and assumed it was just inferring what work was necessary, but this isn't the case. I feel like this should be a solvable problem but I have a gut feeling it runs into some sort of halting problem issues.

An input is not necessary

Virtually every example rule in the documentation has an input, but this isn't strictly necessary. At the beginning of my pipeline I take sequencing data and merge the paired end reads, and while the script that's run does take actual inputs, there was no reason to reference them.

Referencing a directory that contains outputs as an input will cause Snakemake to rerun the pipeline unnecessarily

I had mistakenly referenced an output directory as an input to a rule when it was completely unnecessary (it was actually the input mentioned in the previous section that wasn't necessary at all). Because files were being written to that directory, the last-modified timestamp of the directory was being updated, so it always appeared to Snakemake that the earlier rule had a fresher input than the rest of the pipeline. Unfortunately, this was also my very first rule, so the entire pipeline was running from scratch even after it had completed, which took several hours. Although this is not as likely to happen to other people, it's definitely what motivated this post.

Script sections referenced variables that are surprisingly out of scope

I had rules with a script: section, and I tried to reference a variable called {run} that was used in the input and output sections. The solutions is simply to use {}, but the error message doesn't make this clear. This is in the documentation so I was fundamentally the problem there.

Not escaping Python formatting properly

When using .format() to build up strings in my Snakefile, I didn't realize that you had to escape the Snakemake variables with double curly braces. This has nothing to do with Snakemake, I was just not too familiar with how the built-in format() function worked. However, the error message didn't make it clear that this was a Python syntax issue and not a Snakemake issue.


I feel that there's probably a better way to present the logic of Snakemake to new users such that they won't encounter these errors, and in some cases the error messaging might be able to be improved. Overall, however, this is so much better than GNU Make for a host of reasons, and is orders of magnitude better than not having any build system at all. Once I get enough experience with it I might write a tutorial or propose changes to the official documentation.

A temporary fix to several audio bugs in Ubuntu

Published: 2018-03-18 1 minute read

There are two bugs affecting my computer that prevent audio from playing:

  1. If I boot my Ubuntu laptop with headphones plugged in, no audio will play while the headphones are plugged in.
  2. Firefox 58 and 59 will not play audio unless another program produced a sound before Firefox was started.

I finally found a combination of hacks that resolves these issues, though this is clearly only for the kind of sociopath that continues to use Linux under such conditions.

Install mpg123: sudo apt install mpg123

Download this 50 ms inaudible mp3 I made: curl -o ~/Documents/very-quiet-noise.mp3

Then add this to ~/.profile:

# Plays a sound at boot so Firefox 58+ will be able to play audio.
mpg123 -q  ~/Documents/very-quiet-noise.mp3

# If we booted with headphones plugged in, this will allow audio to work
# The strange invocation is required to suppress an error message on boot
bash -c 'alsactl restore || true' 2>&1 > /dev/null

Custom scatterplot colors with colorbar in Matplotlib

Published: 2018-01-24 1 minute read

I needed to plot a standard scatterplot, but color the points according to some third dimension of data, and add a colorbar. This was surprisingly not documented anywhere I could find.

import matplotlib
import matplotlib.pyplot as plt
import random

# We have three dimensions of data. x and y will be plotted on the x and y axis, while z will 
# be represented with color.
# If z is a numpy array, matplotlib refuses to plot this.
x = list(range(300))
y = sorted([random.random()*10 for i in range(300)])
z = list(reversed([i**3 for i in y]))

# cmap will generate a tuple of RGBA values for a given number in the range 0.0 to 1.0 
# (also 0 to 255 - not used in this example).
# To map our z values cleanly to this range, we create a Normalize object.
cmap ='viridis')
normalize = matplotlib.colors.Normalize(vmin=min(z), vmax=max(z))
colors = [cmap(normalize(value)) for value in z]

fig, ax = plt.subplots(figsize=(10,10))
ax.scatter(x, y, color=colors)

# Optionally add a colorbar
cax, _ = matplotlib.colorbar.make_axes(ax)
cbar = matplotlib.colorbar.ColorbarBase(cax, cmap=cmap, norm=normalize)


Using compound data types in h5py

Published: 2017-12-31 2 minute read

Compound data types allow you to create Numpy arrays of heterogeneous data types and store them in HDF5. For the example in this blog post, we want to store X and Y coordinates (as unsigned 32-bit integers), an intensity value (as a float), and a DNA sequence (as a variable-length string).

import numpy as np
import h5py

my_datatype = np.dtype([('x', np.uint32),
                        ('y', np.uint32),
                        ('intensity', np.float),
                        ('sequence', h5py.special_dtype(vlen=str))])
h5 = h5py.File("my_data.h5", "w")
data = [(1003, 4321, 43.2, "ACGTACTG"), (55, 4098, 12.1, "GGT"), (3209, 909, 59.7, "ACTAC")]
dataset = h5.create_dataset('/coordinates', (len(data),), dtype=my_datatype)
data_array = np.array(data, dtype=my_datatype)

We can look at data_array and see how this datatype looks:

In [1]: data_array
array([(1003, 4321,  43.2, 'ACGTACTG'), (  55, 4098,  12.1, 'GGT'),
       (3209,  909,  59.7, 'ACTAC')],
      dtype=[('x', '<u4'), ('y', '<u4'), ('intensity', '<f8'), ('sequence', 'O')])

To save the data efficiently, just assign it to the dataset like so:

dataset[...] = data_array

Regular indexing works:

In [2]: h5['/coordinates'][0]
Out[2]: (1003, 4321,  43.2, 'ACGTACTG')

Selection syntax works basically as you'd expect. Optionally, you can tack on an extra index with fields in it to limit the fields that are returned:

selector = (h5['coordinates']['x'] > 1000) & (h5['coordinates']['intensity'] > 40.0)
fields = ['x', 'y']
coords = h5['/coordinates'][selector][fields]

coords is now a compound datatype consisting of just two 32-bit unsigned integers:

In [3]: coords
array([(1003, 4321), (3209,  909)],
      dtype=[('x', '<u4'), ('y', '<u4')])

Even though coords is homogenous, many numpy operations can't be performed on it. This might not be the best way to do this, but you can get a regular array with:

In [4]: raw_coordinates = np.stack((coords['x'], coords['y']), axis=-1)
In [5]: raw_coordinates
array([[1003, 4321],
       [3209,  909]], dtype=uint32)