Ever since I was a teenager, and all through my university years, I knew that I didn’t want to end up like my parents working a 9-to-5 job punching my timecard like a schmo and being bossed around by some dumb boss. My mum brought me up as a communist, so I knew that it was “because of Capitalism” that pretty much all jobs are soulless and dehumanised (I still believe this). But growing up in the neo-liberalism of the 90s and 2000s, I had no sense of imagination that things could ever be any different. I read David Graeber’s books a few years ago and got a little interested in Anarchism as an alternative way of working in teams. But the idea of the endless meetings and consensus processes sounded dreadful and was a big turnoff.»
One of the major limitations of Scikit Learn is its lack of native support of categorical variables for tree models. One way around this is to use
pd.get_dummies to perform a one-hot-encoding on categorical variables, however this doesn’t work well when the column has a lot of levels. Instead it can actually be more effective to convert the column to a numeric id and treat it as a continuous variable – as long as your trees have enough depth, they will naturally re-segment the variable into its natural categories. (I have routinely achieved up to 5% improvement on AUROC using this techique compared to a model of the same size/complexity using dummy variables.) The introduction of the
categorical data type in Pandas since version 0.15 makes this a breeze – the column is stored as a vector of Numpy
int64, together with a list of levels which is used as a codebook.
One of my biggest pet peeves with Pandas is how hard it is to create a panel of bar charts grouped by another variable. I know that this would be nontrivial in Excel too, (I guess you’d have to manually create separate charts from a pivot table) but the problem is that I’ve always been taunted by the
by parameter in
histogram, which I never get to use since 98% of the time I’m dealing with categorical variables instead of numerical variables. And apparently categorical data have bar charts not histograms which according to some sticklers are somehow not the same thing (I insist they are!).
A friend from uni is just finishing his PhD and got in touch with me asking for advice about finding a non-academic job, and the difference between academia and industry. I really enjoyed the chance to reflect on my past couple of years, and I thought I might share my advice to him in case it’s useful to other people.»
At work everyone’s starting to catch the functional programming bug (starting with F#, but now with people trying out Clojure and Scala) and I don’t want to be left behind. Well of course Python isn’t the best choice for functional prgramming (no tail-call recursion etc.) and I’m starting to play with Clojure (with Flambo for Spark), but in the mean time I’ve been wanting to try and wrap my head around the “no side effects” concept of functional programming in my existing Pandas code.»
Data pipelines are a good way to deploy a simple data processing task which needs to run on a daily or weekly schedule; it will automatically provision an EMR cluster for you, run your script, and then shut down at the end. If the pipeline is more complex, it might be worth using something like AirBnB’s Airflow which has recently been open-sourced and is definitely on my list of things to check out.»
A coworker recently asked for my help getting a Python 3 IPython Notebook server running on an Amazon EC2 instance. It was pretty straightforward in the end, but it involved a bit of trial and error, so I thought I’d collect the steps here just in someone else (or myself) needs to do this again. I tried this on the primary node of an EMR cluster, AMI version 3.8.0, try at your own peril on any other AMI…»
One of my coworkers was recently working on a project that requried lemmatizing a large number of documents using NLTK in Python. His solution (running locally on his laptop) was taking 20 hours to run, and being obsessed with Hive at the time, naturally I wanted to see if I could save him some time by implementing it as a custom UDF. In the end was a pretty painful process, but I learned a lot, and I feel a lot more confident trying this kind of thing in the future. I also learned a neat trick for debugging UDF’s which I’m going to share!»