AS3-1: 數位偵探

犯罪是一個國際關注的問題，但它在不同的國家以不同的方式記錄和處理。在美國，聯邦調查局（FBI）記錄了暴力犯罪和財產犯罪。此外，每個城市都記錄了犯罪行為，一些城市發布了有關犯罪率的數據。伊利諾伊州芝加哥市從2001年開始在線發布犯罪數據。芝加哥是美國人口第三多的城市，人口超過270萬。在這個作業裡面，我們將關注一種特定類型的財產犯罪，稱為「汽車盜竊」，我們將使用R中的一些基本數據分析來了解芝加哥的汽車盜竊紀錄。請用 read.csv() 讀進 data/mvtWeek1.csv 資料檔。

🌻 以下是各欄位的定義：

ID: 事件ID
Date: 日期與時間
LocationDescription: 失竊地點
Arrest: 是否破案
Domestic: 是否為家庭犯罪
Beat: 小區代碼
District: 區碼
CommunityArea: 社區代碼
Year: 年分
Latitude: 緯度
Longitude: 經度

🌻 藉由分析這些紀錄，我們可以：

了解「汽車盜竊」事件的整體狀況和趨勢
描述「汽車盜竊」事件在時、空間之中的分佈狀況
做為警力配置的參考

🌻 這個練習的學習重點是：

字串和日期(時間)資料的轉換
練習使用分類計數和分類統計功能 table and tapply

Section-1 Loading the Data

Read the data into a data frame D.

D = read.csv("data/mvtWeek1.csv")

As I said, the names of your major data frame is the shorter the better.

【1.1】How many rows of data (observations) are in this dataset?

【1.2】How many variables are in this dataset?

Let’s also check the data types of these variables

and do a quick summary on them.

【1.3】Using the “max” function, what is the maximum value of the variable ID?

【1.4】 What is the minimum value of the variable Beat?

【1.5】 How many observations have value TRUE in the Arrest variable (this is the number of crimes for which an arrest was made)?

sum(D$Arrest)

[1] 15536

What is the average arrest rate of car thief events?

【1.6】 How many observations have a LocationDescription value of ALLEY?

Section-2 Processing Date and Time in R

【2.1】 In what format are the entries in the variable Date?

Month/Day/Year Hour:Minute
Day/Month/Year Hour:Minute
Hour:Minute Month/Day/Year
Hour:Minute Day/Month/Year

head(D$Date)  # Month/Day/Year Hour:Minute

[1] "12/31/12 23:15" "12/31/12 22:00" "12/31/12 22:00" "12/31/12 22:00" "12/31/12 21:30"
[6] "12/31/12 20:30"

🌞 Date and Time are quite troublesome. Naturally the time line is a rational linear dimension. However, these nice real-number properties are all messed up in human society. Months have different days. Week may span over two months. The worst is, people record date and time in so many different formats that computers cannot read them automatically. Therefore, dates and time are usually read in as character strings. We have to manually look into the strings, recognize the format and then convert them into Date or POSIXct (name of a standard timing system) data types before we can properly using them in our programs. For an example, we need to observe the format in D$date and convert this chr column into a POSIXct vector (ts).

ts = as.POSIXct(D$Date, format="%m/%d/%y %H:%M")

🌻 as.POSIXct(x, format) converts a string vector x into a POSIXct vector.

🌷 Watch how format of the time string is specified in Date and Time Codes (see Table-1 below)

Date and Time Formating Codes
Table-1 Date and Time Formatting Codes

Now we can make histograms of this time variable (ts) in the same way as we do for numerics …

par(mfrow=c(1,2), mar=c(4,3,3,1), cex=0.7)
hist(ts,"year",las=2,freq=T,xlab="",main="Event Counts in Years")
hist(ts,"quarter",las=2,freq=T,xlab="",main="Event Counts in Quartera")

🌻 format() can convert date/times back to strings in the desirable format

For examples, if we want to count the car thief events by weekdays …

table( format(ts,'%w') )


    0     1     2     3     4     5     6 
26316 27397 26791 27416 27319 29284 27118

by hours …

table( format(ts,'%H') )


   00    01    02    03    04    05    06    07    08    09    10    11    12    13    14 
13212  6643  5377  4202  3357  3728  5136  6852  7866  8136  6566  5415  8158  5555  6419 
   15    16    17    18    19    20    21    22    23 
 7512  7736  8635 10521 10427 11716 12434 14745 11293

by 7 days and 24 hours …

table(weekday=format(ts,'%u'), month=format(ts,'%H'))

       month
weekday   00   01   02   03   04   05   06   07   08   09   10   11   12   13   14   15
      1 1900  825  712  527  415  542  772 1123 1323 1235  971  737 1129  824  958 1059
      2 1691  777  603  464  414  520  845 1118 1175 1174  948  786 1108  762  908 1071
      3 1814  790  619  469  396  561  862 1140 1329 1237  947  763 1225  804  863 1075
      4 1856  816  696  508  400  534  799 1135 1298 1301  932  731 1093  752  831 1044
      5 1873  932  743  560  473  602  839 1203 1268 1286  938  822 1207  857  937 1140
      6 2050 1267  985  836  652  508  541  650  858 1039  946  789 1204  767  963 1086
      7 2028 1236 1019  838  607  461  478  483  615  864  884  787 1192  789  959 1037
       month
weekday   16   17   18   19   20   21   22   23
      1 1136 1252 1518 1503 1622 1815 2009 1490
      2 1090 1274 1553 1496 1696 1816 2044 1458
      3 1076 1289 1580 1507 1718 1748 2093 1511
      4 1131 1258 1510 1537 1668 1776 2134 1579
      5 1165 1318 1623 1652 1736 1881 2308 1921
      6 1055 1084 1348 1390 1570 1702 2078 1750
      7 1083 1160 1389 1342 1706 1696 2079 1584

With a few more line of codes we can transform the table into a heatmap that illustrates the time distribution of car thief events nicely. The plotting function will be elaborated latter in this course

table(format(ts,"%u"), format(ts,"%H")) %>% 
  as.data.frame() %>% setNames(c("weekday","hour","count")) %>% 
  ggplot(aes(hour,weekday,fill=count)) +
  geom_tile() + coord_fixed(ratio=1) + 
  scale_fill_gradientn(colors=c("seagreen","yellow","darkred"))

💡 Hints: The most useful functions in R are …
■ table()
■ tapply()
incorporating with …
■ sort(), head(), tail()
■ sum(), mean, max(), min
You should be able to solve all fo the questions below

【2.2】 What is the month and year of the median date in our dataset?

【2.3】 In which month did the fewest motor vehicle thefts occur?

【2.4】 On which weekday did the most motor vehicle thefts occur?

【2.5】 Which month has the largest number of motor vehicle thefts for which an arrest was made?

Section-3 Visualizing Crime Trends

【3.1】 In general …

does it look like crime increases or decreases from 2002 - 2012?
does it look like crime increases or decreases from 2005 - 2008?
does it look like crime increases or decreases from 2009 - 2011?

【3.2】 Does it look like there were more crimes for which arrests were made in the first half of the time period or the second half of the time period?

【3.3】 For what proportion of motor vehicle thefts in 2001 was an arrest made?

【3.4】 For what proportion of motor vehicle thefts in 2007 was an arrest made?

【3.5】 For what proportion of motor vehicle thefts in 2012 was an arrest made?

Section-4 Popular Locations

【4.1】 Which locations are the top five locations for motor vehicle thefts, excluding the “Other” category? You should select 5 of the following options.

【4.2】 How many observations are in Top5?

【4.3】 One of the locations has a much higher arrest rate than the other locations. Which is it?

【4.4】 On which day of the week do the most motor vehicle thefts at gas stations happen?

【4.5】 On which day of the week do the fewest motor vehicle thefts in residential driveways happen?

NINJA’s DOJO ● ● ●

L6 = table(D$Loc) %>% sort %>% tail() 
D %>% mutate(ts = ts) %>% 
  filter(LocationDescription %in% names(L6)[-4]) %>% 
  mutate(Loc = str_remove(LocationDescription, " .*$")) %>% 
  group_by(Loc, wday=format(ts,"%u"), hour=format(ts,"%H")) %>% 
  summarise(no.events = n(), arrest.rate = mean(Arrest), .groups="drop") %>% 
  group_by(Loc) %>% mutate_at(vars(no.events,arrest.rate), scale) %>% 
  gather("key","value",4:5) %>% 
  ggplot(aes(x=hour, y=wday,fill=value)) + geom_tile(alpha=0.75) +
  scale_fill_gradientn(colors=c("darkgreen","green","yellow","red","darkred")) +
  coord_fixed(ratio=1) + facet_grid(Loc~key) + theme_bw() +
  theme(axis.ticks=element_blank(),axis.text=element_text(size=6)) +
  ggtitle("Vehical Crimes in a Week")

💡 策略意涵：
♞ 如果你是警方，你要如何配置警力？
♞ 那如果你是小偷呢？

🌷 光是使用table()和tapply()和資料視覺化，我們就已經可以從資料中找到許多有策略意涵的資訊了。