Book Review, Frederick Laloux, Rethinking Orginsations

Ever since I was a teenager, and all through my university years, I knew that I didn’t want to end up like my parents working a 9-to-5 job punching my timecard like a schmo and being bossed around by some dumb boss. My mum brought me up as a communist, so I knew that it was “because of Capitalism” that pretty much all jobs are soulless and dehumanised (I still believe this). But growing up in the neo-liberalism of the 90s and 2000s, I had no sense of imagination that things could ever be any different. I read David Graeber’s books a few years ago and got a little interested in Anarchism as an alternative way of working in teams. But the idea of the endless meetings and consensus processes sounded dreadful and was a big turnoff.

»

Productionising a model using Pandas Categorical variables.

One of the major limitations of Scikit Learn is its lack of native support of categorical variables for tree models. One way around this is to use pd.get_dummies to perform a one-hot-encoding on categorical variables, however this doesn’t work well when the column has a lot of levels. Instead it can actually be more effective to convert the column to a numeric id and treat it as a continuous variable – as long as your trees have enough depth, they will naturally re-segment the variable into its natural categories. (I have routinely achieved up to 5% improvement on AUROC using this techique compared to a model of the same size/complexity using dummy variables.) The introduction of the categorical data type in Pandas since version 0.15 makes this a breeze – the column is stored as a vector of Numpy int64, together with a list of levels which is used as a codebook.

»

Grouped "histograms" for categorical data in Pandas

One of my biggest pet peeves with Pandas is how hard it is to create a panel of bar charts grouped by another variable. I know that this would be nontrivial in Excel too, (I guess you’d have to manually create separate charts from a pivot table) but the problem is that I’ve always been taunted by the by parameter in histogram, which I never get to use since 98% of the time I’m dealing with categorical variables instead of numerical variables. And apparently categorical data have bar charts not histograms which according to some sticklers are somehow not the same thing (I insist they are!).

»

Advice from a recovering academic

A friend from uni is just finishing his PhD and got in touch with me asking for advice about finding a non-academic job, and the difference between academia and industry. I really enjoyed the chance to reflect on my past couple of years, and I thought I might share my advice to him in case it’s useful to other people.

»

Functional pipelines in Pandas

At work everyone’s starting to catch the functional programming bug (starting with F#, but now with people trying out Clojure and Scala) and I don’t want to be left behind. Well of course Python isn’t the best choice for functional prgramming (no tail-call recursion etc.) and I’m starting to play with Clojure (with Flambo for Spark), but in the mean time I’ve been wanting to try and wrap my head around the “no side effects” concept of functional programming in my existing Pandas code.

»

A Python script on AWS Data Pipeline

Data pipelines are a good way to deploy a simple data processing task which needs to run on a daily or weekly schedule; it will automatically provision an EMR cluster for you, run your script, and then shut down at the end. If the pipeline is more complex, it might be worth using something like AirBnB’s Airflow which has recently been open-sourced and is definitely on my list of things to check out.

»

Installing a Python 3 Notebook server on EMR

A coworker recently asked for my help getting a Python 3 IPython Notebook server running on an Amazon EC2 instance. It was pretty straightforward in the end, but it involved a bit of trial and error, so I thought I’d collect the steps here just in someone else (or myself) needs to do this again. I tried this on the primary node of an EMR cluster, AMI version 3.8.0, try at your own peril on any other AMI…

»

An NLTK Lemmatizer UDF in Hive

One of my coworkers was recently working on a project that requried lemmatizing a large number of documents using NLTK in Python. His solution (running locally on his laptop) was taking 20 hours to run, and being obsessed with Hive at the time, naturally I wanted to see if I could save him some time by implementing it as a custom UDF. In the end was a pretty painful process, but I learned a lot, and I feel a lot more confident trying this kind of thing in the future. I also learned a neat trick for debugging UDF’s which I’m going to share!

»