AS3_1: Digital Detective

In data/mvtWeek1.csv there are vehicle stolen event records in Chicago city. Below are the definitions of the columns …

ID: a unique identifier for each observation
Date: the date the crime occurred
LocationDescription: the location where the crime occurred
Arrest: whether or not an arrest was made for the crime (TRUE if an arrest was made, and FALSE if an arrest was not made)
Domestic: whether or not the crime was a domestic crime, meaning that it was committed against a family member (TRUE if it was domestic, and FALSE if it was not domestic)
Beat: the area, or “beat” in which the crime occurred. This is the smallest regional division defined by the Chicago police department.
District: the police district in which the crime occured. Each district is composed of many beats, and are defined by the Chicago Police Department.
CommunityArea: the community area in which the crime occurred. Since the 1920s, Chicago has been divided into what are called “community areas”, of which there are now 77. The community areas were devised in an attempt to create socially homogeneous regions.
Year: the year in which the crime occurred.
Latitude: the latitude of the location at which the crime occurred.
Longitude: the longitude of the location at which the crime occurred.

The first question is, and should always be, what can we do with this data?

While we conduct the practical analysis, we’ll also …

leran the Date and standard date-time data type POSIXct
practice two most useful functions in analysis - table and tapply

Section-1 Loading the Data

Read the data into a data frame D.

D = read.csv("data/mvtWeek1.csv")

As I said, the names of your major data frame is the shorter the better.

【1.1】How many rows of data (observations) are in this dataset?

nrow(D)

[1] 191641

【1.2】How many variables are in this dataset?

ncol(D)

[1] 11

Let’s also check the data types of these variables

str(D)

'data.frame':   191641 obs. of  11 variables:
 $ ID                 : int  8951354 8951141 8952745 8952223 8951608 8950793 8950760 8951611 8951802 8950706 ...
 $ Date               : chr  "12/31/12 23:15" "12/31/12 22:00" "12/31/12 22:00" "12/31/12 22:00" ...
 $ LocationDescription: chr  "STREET" "STREET" "RESIDENTIAL YARD (FRONT/BACK)" "STREET" ...
 $ Arrest             : logi  FALSE FALSE FALSE FALSE FALSE TRUE ...
 $ Domestic           : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ Beat               : int  623 1213 1622 724 211 2521 423 231 1021 1215 ...
 $ District           : int  6 12 16 7 2 25 4 2 10 12 ...
 $ CommunityArea      : int  69 24 11 67 35 19 48 40 29 24 ...
 $ Year               : int  2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
 $ Latitude           : num  41.8 41.9 42 41.8 41.8 ...
 $ Longitude          : num  -87.6 -87.7 -87.8 -87.7 -87.6 ...

and do a quick summary on them.

summary(D)

       ID              Date           LocationDescription   Arrest         Domestic      
 Min.   :1310022   Length:191641      Length:191641       Mode :logical   Mode :logical  
 1st Qu.:2832144   Class :character   Class :character    FALSE:176105    FALSE:191226   
 Median :4762956   Mode  :character   Mode  :character    TRUE :15536     TRUE :415      
 Mean   :4968629                                                                         
 3rd Qu.:7201878                                                                         
 Max.   :9181151                                                                         
                                                                                         
      Beat         District     CommunityArea        Year         Latitude   
 Min.   : 111   Min.   : 1      Min.   : 0      Min.   :2001   Min.   :41.6  
 1st Qu.: 722   1st Qu.: 6      1st Qu.:22      1st Qu.:2003   1st Qu.:41.8  
 Median :1121   Median :10      Median :32      Median :2006   Median :41.9  
 Mean   :1259   Mean   :12      Mean   :38      Mean   :2006   Mean   :41.8  
 3rd Qu.:1733   3rd Qu.:17      3rd Qu.:60      3rd Qu.:2009   3rd Qu.:41.9  
 Max.   :2535   Max.   :31      Max.   :77      Max.   :2012   Max.   :42.0  
                NA's   :43056   NA's   :24616                  NA's   :2276  
   Longitude    
 Min.   :-87.9  
 1st Qu.:-87.7  
 Median :-87.7  
 Mean   :-87.7  
 3rd Qu.:-87.6  
 Max.   :-87.5  
 NA's   :2276

【1.3】Using the “max” function, what is the maximum value of the variable ID?

【1.4】 What is the minimum value of the variable Beat?

【1.5】 How many observations have value TRUE in the Arrest variable (this is the number of crimes for which an arrest was made)?

sum(D$Arrest)

[1] 15536

What is the average arrest rate of car thief events?

【1.6】 How many observations have a LocationDescription value of ALLEY?

Section-2 Processing Date and Time in R

【2.1】 In what format are the entries in the variable Date?

Month/Day/Year Hour:Minute
Day/Month/Year Hour:Minute
Hour:Minute Month/Day/Year
Hour:Minute Day/Month/Year

head(D$Date)  # Month/Day/Year Hour:Minute

[1] "12/31/12 23:15" "12/31/12 22:00" "12/31/12 22:00" "12/31/12 22:00" "12/31/12 21:30"
[6] "12/31/12 20:30"

🌞 Date and Time are quite troublesome. Naturally the time line is a rational linear dimension. However, these nice real-number properties are all messed up in human society. Months have different days. Week may span over two months. The worst is, people record date and time in so many different formats that computers cannot read them automatically. Therefore, dates and time are usually read in as character strings. We have to manually look into the strings, recognize the format and then convert them into Date or POSIXct (name of a standard timing system) data types before we can properly using them in our programs. For an example, we need to observe the format in D$date and convert this chr column into a POSIXct vector (ts).

ts = as.POSIXct(D$Date, format="%m/%d/%y %H:%M")

🌻 as.POSIXct(x, format) converts a string vector x into a POSIXct vector.

🌷 Watch how format of the time string is specified in Date and Time Codes (see Table-1 below)

Date and Time Formating Codes
Table-1 Date and Time Formatting Codes

Now we can make histograms of this time variable (ts) in the same way as we do for numerics …

par(mfrow=c(1,2), mar=c(4,3,3,1), cex=0.7)
hist(ts,"year",las=2,freq=T,xlab="",main="Event Counts in Years")
hist(ts,"quarter",las=2,freq=T,xlab="",main="Event Counts in Quartera")

🌻 format() can convert date/times back to strings in the desirable format

For examples, if we want to count the car thief events by weekdays …

table( format(ts,'%w') )


    0     1     2     3     4     5     6 
26316 27397 26791 27416 27319 29284 27118

by hours …

table( format(ts,'%H') ) %>% barplot

by 7 days and 24 hours …

table( weekday=format(ts,'%u'), hour=format(ts,'%H') )

       hour
weekday   00   01   02   03   04   05   06   07   08   09   10   11   12   13   14   15
      1 1900  825  712  527  415  542  772 1123 1323 1235  971  737 1129  824  958 1059
      2 1691  777  603  464  414  520  845 1118 1175 1174  948  786 1108  762  908 1071
      3 1814  790  619  469  396  561  862 1140 1329 1237  947  763 1225  804  863 1075
      4 1856  816  696  508  400  534  799 1135 1298 1301  932  731 1093  752  831 1044
      5 1873  932  743  560  473  602  839 1203 1268 1286  938  822 1207  857  937 1140
      6 2050 1267  985  836  652  508  541  650  858 1039  946  789 1204  767  963 1086
      7 2028 1236 1019  838  607  461  478  483  615  864  884  787 1192  789  959 1037
       hour
weekday   16   17   18   19   20   21   22   23
      1 1136 1252 1518 1503 1622 1815 2009 1490
      2 1090 1274 1553 1496 1696 1816 2044 1458
      3 1076 1289 1580 1507 1718 1748 2093 1511
      4 1131 1258 1510 1537 1668 1776 2134 1579
      5 1165 1318 1623 1652 1736 1881 2308 1921
      6 1055 1084 1348 1390 1570 1702 2078 1750
      7 1083 1160 1389 1342 1706 1696 2079 1584

With a few more line of codes we can transform the table into a heatmap that illustrates the time distribution of car thief events nicely. The plotting function will be elaborated latter in this course

table(format(ts,"%u"), format(ts,"%H")) %>% 
  as.data.frame() %>% setNames(c("weekday","hour","count")) %>% 
  ggplot(aes(hour,weekday,fill=count)) +
  geom_tile() + coord_fixed(ratio=1) + 
  scale_fill_gradientn(colors=c("seagreen","yellow","darkred"))

💡 Hints: The most useful functions in R are …
■ table()
■ tapply()
incorporating with …
■ sort(), head(), tail()
■ sum(), mean, max(), min
You should be able to solve all fo the questions below

【2.2】 What is the month and year of the median date in our dataset?

【2.3】 In which month did the fewest motor vehicle thefts occur?

【2.4】 On which weekday did the most motor vehicle thefts occur?

【2.5】 Which month has the largest number of motor vehicle thefts for which an arrest was made?

Section-3 Visualizing Crime Trends

【3.1】 In general …

does it look like crime increases or decreases from 2002 - 2012?
does it look like crime increases or decreases from 2005 - 2008?
does it look like crime increases or decreases from 2009 - 2011?

【3.2】 Does it look like there were more crimes for which arrests were made in the first half of the time period or the second half of the time period?

【3.3】 For what proportion of motor vehicle thefts in 2001 was an arrest made?

【3.4】 For what proportion of motor vehicle thefts in 2007 was an arrest made?

【3.5】 For what proportion of motor vehicle thefts in 2012 was an arrest made?

Section-4 Popular Locations

【4.1】 Which locations are the top five locations for motor vehicle thefts, excluding the “Other” category? You should select 5 of the following options.

【4.2】 How many observations are in Top5?

【4.3】 One of the locations has a much higher arrest rate than the other locations. Which is it?

【4.4】 On which day of the week do the most motor vehicle thefts at gas stations happen?

【4.5】 On which day of the week do the fewest motor vehicle thefts in residential driveways happen?

NINJA’s DOJO ● ● ●

L6 = table(D$Loc) %>% sort %>% tail() 
D %>% mutate(ts = ts) %>% 
  filter(LocationDescription %in% names(L6)[-4]) %>% 
  mutate(Loc = str_remove(LocationDescription, " .*$")) %>% 
  group_by(Loc, wday=format(ts,"%u"), hour=format(ts,"%H")) %>% 
  summarise(no.events = n(), arrest.rate = mean(Arrest), .groups="drop") %>% 
  group_by(Loc) %>% mutate_at(vars(no.events,arrest.rate), scale) %>% 
  gather("key","value",4:5) %>% 
  ggplot(aes(x=hour, y=wday,fill=value)) + geom_tile(alpha=0.75) +
  scale_fill_gradientn(colors=c("darkgreen","green","yellow","red","darkred")) +
  coord_fixed(ratio=1) + facet_grid(Loc~key) + theme_bw() +
  theme(axis.ticks=element_blank(),axis.text=element_text(size=6)) +
  ggtitle("Vehical Crimes in a Week")

💡 Quiz:
■ What can you observe from the chart above?
■ If you were law enforcement, how can the chart help you?
■ If you were a motor thief, how can the chart help you?

🌷 Note that we haven’t done any solid analysis yet. Simply exploring the data with table() and tapply(), we can draw a lot of useful information from the data already.