犯罪是一個國際關注的問題,但它在不同的國家以不同的方式記錄和處理。
在美國,聯邦調查局(FBI)記錄了暴力犯罪和財產犯罪。
此外,每個城市都記錄了犯罪行為,一些城市發布了有關犯罪率的數據。
伊利諾伊州芝加哥市從2001年開始在線發布犯罪數據。
芝加哥是美國人口第三多的城市,人口超過270萬。
在這個作業裡面,我們將關注一種特定類型的財產犯罪,稱為「汽車盜竊」,
我們將使用R中的一些基本數據分析來了解芝加哥的汽車盜竊紀錄。 請用
read.csv()
讀進 data/mvtWeek1.csv
資料檔。
🌻 以下是各欄位的定義:
ID
: 事件IDDate
: 日期與時間LocationDescription
: 失竊地點Arrest
: 是否破案Domestic
: 是否為家庭犯罪Beat
: 小區代碼District
: 區碼CommunityArea
: 社區代碼Year
: 年分Latitude
: 緯度Longitude
: 經度🌻 藉由分析這些紀錄,我們可以:
🌻 這個練習的學習重點是:
table
and
tapply
Read the data into a data frame D
.
= read.csv("data/mvtWeek1.csv") D
As I said, the names of your major data frame is the shorter the better.
【1.1】How many rows of data (observations) are in this dataset?
#
【1.2】How many variables are in this dataset?
#
Let’s also check the data types of these variables
#
and do a quick summary on them.
#
【1.3】Using the “max” function, what is the maximum value of the
variable ID
?
#
【1.4】 What is the minimum value of the variable
Beat
?
#
【1.5】 How many observations have value TRUE in the
Arrest
variable (this is the number of crimes for which an
arrest was made)?
sum(D$Arrest)
[1] 15536
What is the average arrest rate of car thief events?
#
【1.6】 How many observations have a LocationDescription
value of ALLEY
?
#
【2.1】 In what format are the entries in the variable Date?
head(D$Date) # Month/Day/Year Hour:Minute
[1] "12/31/12 23:15" "12/31/12 22:00" "12/31/12 22:00" "12/31/12 22:00" "12/31/12 21:30"
[6] "12/31/12 20:30"
🌞 Date and Time are quite troublesome. Naturally the time line is a
rational linear dimension. However, these nice real-number properties
are all messed up in human society. Months have different days. Week may
span over two months. The worst is, people record date and time in so
many different formats that computers cannot read them automatically.
Therefore, dates and time are usually read in as character strings. We
have to manually look into the strings, recognize the format and then
convert them into Date
or POSIXct
(name of a
standard timing system) data types before we can properly using them in
our programs. For an example, we need to observe the format in
D$date
and convert this chr
column into a
POSIXct
vector (ts
).
= as.POSIXct(D$Date, format="%m/%d/%y %H:%M") ts
🌻 as.POSIXct(x, format)
converts a string vector
x
into a POSIXct
vector.
🌷 Watch how format of the time string is specified in Date and Time Codes (see Table-1 below)
Table-1 Date and Time
Formatting Codes
Now we can make histograms of this time variable (ts
) in
the same way as we do for numerics …
par(mfrow=c(1,2), mar=c(4,3,3,1), cex=0.7)
hist(ts,"year",las=2,freq=T,xlab="",main="Event Counts in Years")
hist(ts,"quarter",las=2,freq=T,xlab="",main="Event Counts in Quartera")
🌻 format()
can convert date/times back to strings in
the desirable format
For examples, if we want to count the car thief events by weekdays …
table( format(ts,'%w') )
0 1 2 3 4 5 6
26316 27397 26791 27416 27319 29284 27118
by hours …
table( format(ts,'%H') )
00 01 02 03 04 05 06 07 08 09 10 11 12 13 14
13212 6643 5377 4202 3357 3728 5136 6852 7866 8136 6566 5415 8158 5555 6419
15 16 17 18 19 20 21 22 23
7512 7736 8635 10521 10427 11716 12434 14745 11293
by 7 days and 24 hours …
table(weekday=format(ts,'%u'), month=format(ts,'%H'))
month
weekday 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15
1 1900 825 712 527 415 542 772 1123 1323 1235 971 737 1129 824 958 1059
2 1691 777 603 464 414 520 845 1118 1175 1174 948 786 1108 762 908 1071
3 1814 790 619 469 396 561 862 1140 1329 1237 947 763 1225 804 863 1075
4 1856 816 696 508 400 534 799 1135 1298 1301 932 731 1093 752 831 1044
5 1873 932 743 560 473 602 839 1203 1268 1286 938 822 1207 857 937 1140
6 2050 1267 985 836 652 508 541 650 858 1039 946 789 1204 767 963 1086
7 2028 1236 1019 838 607 461 478 483 615 864 884 787 1192 789 959 1037
month
weekday 16 17 18 19 20 21 22 23
1 1136 1252 1518 1503 1622 1815 2009 1490
2 1090 1274 1553 1496 1696 1816 2044 1458
3 1076 1289 1580 1507 1718 1748 2093 1511
4 1131 1258 1510 1537 1668 1776 2134 1579
5 1165 1318 1623 1652 1736 1881 2308 1921
6 1055 1084 1348 1390 1570 1702 2078 1750
7 1083 1160 1389 1342 1706 1696 2079 1584
With a few more line of codes we can transform the table into a heatmap that illustrates the time distribution of car thief events nicely. The plotting function will be elaborated latter in this course
table(format(ts,"%u"), format(ts,"%H")) %>%
as.data.frame() %>% setNames(c("weekday","hour","count")) %>%
ggplot(aes(hour,weekday,fill=count)) +
geom_tile() + coord_fixed(ratio=1) +
scale_fill_gradientn(colors=c("seagreen","yellow","darkred"))
💡 Hints: The most useful functions
in R are …
■ table()
■
tapply()
incorporating with …
■
sort()
, head()
, tail()
■
sum()
, mean
, max()
,
min
You should be able to solve all fo the questions
below
【2.2】 What is the month and year of the median date in our dataset?
#
【2.3】 In which month did the fewest motor vehicle thefts occur?
#
【2.4】 On which weekday did the most motor vehicle thefts occur?
#
【2.5】 Which month has the largest number of motor vehicle thefts for which an arrest was made?
#
【3.1】 In general …
#
【3.2】 Does it look like there were more crimes for which arrests were made in the first half of the time period or the second half of the time period?
#
【3.3】 For what proportion of motor vehicle thefts in 2001 was an arrest made?
#
【3.4】 For what proportion of motor vehicle thefts in 2007 was an arrest made?
#
【3.5】 For what proportion of motor vehicle thefts in 2012 was an arrest made?
#
【4.1】 Which locations are the top five locations for motor vehicle thefts, excluding the “Other” category? You should select 5 of the following options.
#
【4.2】 How many observations are in Top5?
#
【4.3】 One of the locations has a much higher arrest rate than the other locations. Which is it?
#
【4.4】 On which day of the week do the most motor vehicle thefts at gas stations happen?
#
【4.5】 On which day of the week do the fewest motor vehicle thefts in residential driveways happen?
#
= table(D$Loc) %>% sort %>% tail()
L6 %>% mutate(ts = ts) %>%
D filter(LocationDescription %in% names(L6)[-4]) %>%
mutate(Loc = str_remove(LocationDescription, " .*$")) %>%
group_by(Loc, wday=format(ts,"%u"), hour=format(ts,"%H")) %>%
summarise(no.events = n(), arrest.rate = mean(Arrest), .groups="drop") %>%
group_by(Loc) %>% mutate_at(vars(no.events,arrest.rate), scale) %>%
gather("key","value",4:5) %>%
ggplot(aes(x=hour, y=wday,fill=value)) + geom_tile(alpha=0.75) +
scale_fill_gradientn(colors=c("darkgreen","green","yellow","red","darkred")) +
coord_fixed(ratio=1) + facet_grid(Loc~key) + theme_bw() +
theme(axis.ticks=element_blank(),axis.text=element_text(size=6)) +
ggtitle("Vehical Crimes in a Week")
💡 策略意涵:
♞
如果你是警方,你要如何配置警力?
♞ 那如果你是小偷呢?
🌷 光是使用table()
和tapply()
和資料視覺化,
我們就已經可以從資料中找到許多有策略意涵的資訊了。