Galvanize Data Science: Week 1

Wow, what a week!

If I thought the first week was tough, I was wrong. I haven’t worked this hard in a long time. It’s incredibly exciting though to be working and learning in a place like Galvanize. My fellow classmates come from all walks of life: structural engineers, web developers, business analysts, even a snowboarding instructor. The week started off with a two-hour assessment on Python, Numpy, Pandas (not the bear), SQL, Calculus, Linear Algebra, Probability, and Statistics. I’m very glad that I took Week 0 because I know for a fact that if I hadn’t, the test would have been much harder. This test will serve as a baseline for our progress going through the program.

After the initial assessment, we went through all the nitty-gritty, boiler-plate at Galvanize. We got our keys, wifi setups, learned the rules. Turns out that after the program ends in August, I get 6-months of access to all the facilities that Galvanize has: conference rooms, desks, the roof deck, social events, networking, etc. That’s an awesome bonus I was unaware of.

Going forward, the schedule more or less follows the table below.

Program Schedule

8:30 - 9 am Daily Quiz
9 - 11ish Morning Lecture
11ish - 12ish Individual Programming Assignment
12ish - 1:15 pm Lunch
1:15 - 3ish Afternoon Lecture
3ish - 5pm +/- 30 Pair Programming Assignment

EDIT: The reality seems to be that I leave Galvanize around 6pm or later.

We covered far too much information this week in lectures to go over on this blog, but here are the highlights.

Git + GitHub

One of the biggest focuses of this week has been getting familiar with Git and Github.  These two tools are fast becoming the industry-standards for version control. They allow scientists, engineers, hobbyists and the like to coordinate projects from all over the world without writing over each other’s changes. In addition, if you were to say, write a line of code that breaks everything, git contains a history of what’s called “commits”. You can revert to previous commits and get back to your working version. Git, is the program which runs version control. Github, is an online service similar to dropbox that allows you to host and collaborate with others. Here’s a link to mine. There isn’t much there yet but it will be filling up fast.

SQL (it’s just a puzzle to get stuff)

In the era of big data, sometimes the biggest problem is just accessing the information you need and leaving the rest behind. SQL (Structured-Query-Language) is a language used by many industries to access their data. Here’s a little example. Let’s say, I have a database called “my_table” and it contains a “favorite_cheese” column.

SELECT * FROM my_table WHERE favorite_cheese='camembert';

This query would return a table of every row in the table ‘my_table’ where the ‘favorite_cheese’ column was equal to ‘camembert’. Seems simple enough but by being creative you can perform incredibly complex operations to access results which are just what you are looking for.

We also covered Bash, Object-Oriented Programming, Pandas, AWS and more but I’ll try to address those in future posts. The one thing I will mention is that if you type

ack --cathy

into your shell, you’ll see an ASCII version of Cathy saying “Chocolate, Chocolate, Chocolate, AACK!”.  How useful is that?!

Screen Shot 2016-05-29 at 10.37.13 AM

The week ended with a happy hour hosted specifically for Galvanize’s Data Science students past and present. We were able to meet students from the previous cohort and learn about their experiences during and after Galvanize. We’ve got a great group and I’m happy to be learning and working with these people.

Next Week: In Week 2, the focus will be on statistics and probability. Stay tuned!


Galvanize Data Science: Week 0

#

View of pioneer square from Galvanize's headquarters in Downtown Seattle View of pioneer square from Galvanize’s headquarters in Downtown Seattle.

In the data science program at Galvanize, you sign up for a 13 week, intensive course in Python, machine learning, statistics and more. It is meant to be a highly efficient means of transitioning into the data science and analytics field; a transition I’ve been excited to make for some time now.

It turns out that Galvanize offers a Week 0, voluntary week, specifically focused on getting the members of the cohort up to snuff when it comes to python programming and linear algebra.

As I knew, going into the program, Galvanize was going to be an intense academic challenge. Already on Day 1 of week 0, I was having to work quite hard, thinking back to my undergraduate days when I worked with vector spaces and matrix algebra. Luckily, nothing was too taxing as of yet.

I’ve been enjoying playing around with the atom text editor which is a very powerful and flexible way of writing in many different languages. In fact, I’m writing this entry using markdown right now. One of my favorite things about it is the fact that I can use LaTeX math notation right in the editor meaning I can write out complex equations, arrays and the like quite easily using the text editor interface.

The location and setting of Galvanize are both quite awesome. It is located in the heart of Seattle’s Pioneer Square in an renovated brick building (which apparently used to be NBBJ’s headquarters). Housed in the building, in addition to the Galvanize’s education programs are many startups, making the atmosphere  busy and exciting. Because this week is voluntary, only part of my future cohort is here, but so far everyone, including the teachers seem very intelligent, motivated, and friendly.

I’m excited for the next 14-ish weeks of my life and the challenges and opportunities that this fellowship will bring me. My plan is to write a blog entry for each week of the program so people can track my progress, and see what a programming bootcamp is really like.


Analyzing the relationship between retail pot sales and call-center data

For years, the criminalization of Marijuana sale and usage has made data collection and research on the topic difficult to perform. In Washington state, Recreational Marijuana went on sale in local dispensaries starting mid-2014. The question of whether or not the opening of a dispensary produces a spike in the amount and type of Marijuana use is a valid question for legislators, administrators, doctors and more.

As an exploratory exercise, I created the following map using call-center data gathered by the Washington State Poison Center on marijuana use and data scraped from the web on the location and opening date of retail marijuana shops in Washington State. Data ranges from January 2014 to August 2015. Both calls and shops are localized by zip code. By scrolling through we can see where and when shops and cases cropped up.

“Cases” are any calls that went to the Washington Poison Center related to Marijuana usage. This could be anything from “My child got into my weed cookies” to a doctor calling to consult on someone who ingested too much Marijuana.

weed_actual

In this period of time there were only a few hundred cases. This was enough however to see some trends in the data. The highest number of cases occurred in the U District and in Pioneer Square.

Please note that currently only shops in KING COUNTY are shown.

This map was created using R, Leaflet, and Shiny.


[R] A little bit on multidimensional arrays and apply()

The command-line can be a little unintuitive when dealing with multidimensional objects since it is a 2D medium. It is therefore hard to envision objects greater than 2-dimensions. They exist however!

An array, in R, is simply a vector (list of objects) where each element has additional “dimension” attributes. In other words, each vector element is given a dimensional position. This is fairly easy to represent 3-dimensionally (see below) but there is no reason why additional dimensional attributes cannot be applied to each vector element, placing them in the 4th, 5th…nth dimensions.

Using array(), I created a 3-dimensional array object (represented by that box with numbers you see below) populated with values 1 to 4. Each of these is given a dimensional attribute, the 1’s located are located at [1,1,1] and [1,2,1]. The 4’s are located at [2,1,2] and [2,2,2], and so on.

Here is the array function:

array(data, dimensions,...)

3Darray_apply_1

The first argument of array() is the actual data to be used. The second argument is dimensions which is an integer vector referring to the maximum dimensions of the array; for the example above, this is 2 by 2 by 2.

Using apply(), we can perform functions on elements which are aligned in certain directions, in this case sum(). The array() function takes the following arguments:

apply(X, margins, FUN)

where X is the array over which apply should be…applied, margins is an integer vector telling R which margins (dimensions) to maintain and which to collapse, and FUN is the function to by applied. Basically, the apply() function is taking the sum over all elements in a certain edge of the cube. The margin attributes simply tell R which edges we are summing over. In the examples below, R converts a 3D array object into a 2D object. You can see the effect of changing the margins attribute on the final result of the summed arrays shown below.

3Darray_apply_2_4]