Over the past year it was one of my goals to learn a legitimate programming language. During my freshmen year of college I had taken elementary MATLAB, C, and had briefly opened the VBA editor in Excel, so while at one point I was well versed in the basics, I had quite a bit of catching up to do if I wanted to deploy something complex into my workflow. To start, I needed to pick a language.
I eventually settled on Python. There were a few reasons for this.
- Abundance of Education
- Codecademy.com is where I learned the basics, with the capstone being file I/O. This is extremely important in a data analysis environment because nearly all of my data is in a tabular format.
- The documentation if fantastic and usually provides clear, concise examples
- Simple Syntax
- I was familiar with static typing from my C days, so trying something dynamically typed interested me in two distinct ways
- Developer time doesn't need to be spent typing variables.
- More tolerant of mistakes a person new to production of code would make (e.g. integer division).
- I was familiar with static typing from my C days, so trying something dynamically typed interested me in two distinct ways
- Speed
- I had previously written a lot of my workflow in VBA. If one has ~50,000 lines of data in Excel, that is pushing the limit. Anything beyond 70,000 is too slow (and often a memory hog anyways). 100,000 lines is just plain impossible without a time machine. I needed something faster.
- Cython. I haven't delved into this, but the ability get my feet wet in a dynamically typed environment (Python) with the option for massive speed gains should I statically type variables (Cython) intrigued me.
- Libraries
- Pandas is magical for large data wrangling.
- NumPy, which Pandas is built on provides some very nice data structures
- Unfortunately, there aren't as many statistical functions as my investigations into the R language leads me to believe, but the list seems to be growing.
- I'd like to dig into Django at some point. Having all the data in the world is nice, but it needs to be presented in an easily accessible form.
- IPython, while not a library, is certainly a nifty editor/console. It makes rapid development and export quick and easy.
- Also not a library, but the Anaconda distribution of Python is a wonderful wrapper of many of the useful libraries. It makes it possible to grab everything you might be interested at once.
With those reasons in mind, Ruby fits many of the criteria, but not all. Similarly, Ruby has attributes Python may not offer as well. For my purposes it seemed that the Python ecosystem fit better.
One of my first major issues was learning that Python is very pointer intensive. I was struggling to understand why some portions of code I had written kept failing. Dictionaries (Python's implementation of a hash table) nested in lists were seemingly untouched despite having their contents modified within a loop. I then discovered that I needed to use the "copy.deepcopy()" function. This is one of the downsides of an interpreted language. Much of the "plumbing" is kept under the floor, which has pros and cons. Such an event wouldn't have happened in C without my explicit permission.
After resolving the deepcopy issue and re-coding some of my VBA (which I had learned on the job) work in Python, massive speed gains were found (~1000x), which surprised me quite a bit. Multiprocessing, which I've used on other projects would improve this even further. Needless to say, I was hooked and barely delve into VBA unless I must. The ability to use a text editor of my choice is also quite huge. The VBA editor is a bit... clunky.
Pandas has also been a boon to productivity. The DataFrame is an incredible object. It truly is a light-speed stand-in for my trusty Excel spreadsheet. I'm working my way through Wes McKinney's "Python for Data Analysis" which elaborates on much of the online documentation for Pandas, which naturally revolves around the DataFrame. Pandas + Numpy + Scipy seems to be a wicked combination for wrangling with data to get quick and accurate results.
Overall, I couldn't be more pleased with my current language of choice, and I look forward to other things I can learn with the new tools on my belt!