I currently work as a data scientist in NYC Data Science Academy. Last night, we had a fantastic open house party to introduce our 12-Week Data Science bootcamp.
It’s a good chance to learn about what is data science, what will you learn in the bootcamp and make friends. To me, it’s a good chance to eat free food and get free drinks! During the event, I suddenly got a question:
What is the most popular drink brand in the Open House Party ?
To answer this question, I did three steps. These would be a good introduction to two R packages (“ggplot2” and “dplyr”)
Step 1: Collect data
Data = read.csv(“OpenHouseParty.csv”,header=T)
Data
[1] Samuel Adams Beer Angry Orchard Corona Extra
[4] Angry Orchard Samuel Adams Beer Angry Orchard
[7] Angry Orchard Angry Orchard Angry Orchard
[10] Corona Extra Corona Extra Blue Moon
[13] Blue Moon Angry Orchard Angry Orchard
[16] Angry Orchard
4 Levels: Angry Orchard Blue Moon … Samuel Adams Beer
Step 2: Group data by Brand
library(dplyr)
NewData = Data %>%
group_by(Brand)%>%
summarise(Count=n())
NewData
Source: local data frame [4 x 2]Brand Count
(fctr) (int)
1 Angry Orchard 9
2 Blue Moon 2
3 Corona Extra 3
4 Samuel Adams Beer 2
Step 3:Create a bar chart
library(ggplot2)
ggplot(NewData,aes(x=Brand,y=Count,fill=Brand))+geom_bar(stat=”identity”)
So my conclusion is I will highly recommend my coworkers to order more Angry Orchard Hard Cider for the next open house event, even though my sample size is relevant small.