In data/mvtWeek1.csv
there are vehicle stolen event
records in Chicago city. Below are the definitions of the columns …
ID
: a unique identifier for each observationDate
: the date the crime occurredLocationDescription
: the location where the crime
occurredArrest
: whether or not an arrest was made for the crime
(TRUE if an arrest was made, and FALSE if an arrest was not made)Domestic
: whether or not the crime was a domestic
crime, meaning that it was committed against a family member (TRUE if it
was domestic, and FALSE if it was not domestic)Beat
: the area, or “beat” in which the crime occurred.
This is the smallest regional division defined by the Chicago police
department.District
: the police district in which the crime
occured. Each district is composed of many beats, and are defined by the
Chicago Police Department.CommunityArea
: the community area in which the crime
occurred. Since the 1920s, Chicago has been divided into what are called
“community areas”, of which there are now 77. The community areas were
devised in an attempt to create socially homogeneous regions.Year
: the year in which the crime occurred.Latitude
: the latitude of the location at which the
crime occurred.Longitude
: the longitude of the location at which the
crime occurred.The first question is, and should always be, what can we do with this data?
While we conduct the practical analysis, we’ll also …
Date
and standard date-time data type
POSIXct
table
and tapply
Read the data into a data frame D
.
= read.csv("data/mvtWeek1.csv") D
As I said, the names of your major data frame is the shorter the better.
【1.1】How many rows of data (observations) are in this dataset?
nrow(D)
[1] 191641
【1.2】How many variables are in this dataset?
ncol(D)
[1] 11
Let’s also check the data types of these variables
str(D)
'data.frame': 191641 obs. of 11 variables:
$ ID : int 8951354 8951141 8952745 8952223 8951608 8950793 8950760 8951611 8951802 8950706 ...
$ Date : chr "12/31/12 23:15" "12/31/12 22:00" "12/31/12 22:00" "12/31/12 22:00" ...
$ LocationDescription: chr "STREET" "STREET" "RESIDENTIAL YARD (FRONT/BACK)" "STREET" ...
$ Arrest : logi FALSE FALSE FALSE FALSE FALSE TRUE ...
$ Domestic : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Beat : int 623 1213 1622 724 211 2521 423 231 1021 1215 ...
$ District : int 6 12 16 7 2 25 4 2 10 12 ...
$ CommunityArea : int 69 24 11 67 35 19 48 40 29 24 ...
$ Year : int 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
$ Latitude : num 41.8 41.9 42 41.8 41.8 ...
$ Longitude : num -87.6 -87.7 -87.8 -87.7 -87.6 ...
and do a quick summary on them.
summary(D)
ID Date LocationDescription Arrest Domestic
Min. :1310022 Length:191641 Length:191641 Mode :logical Mode :logical
1st Qu.:2832144 Class :character Class :character FALSE:176105 FALSE:191226
Median :4762956 Mode :character Mode :character TRUE :15536 TRUE :415
Mean :4968629
3rd Qu.:7201878
Max. :9181151
Beat District CommunityArea Year Latitude
Min. : 111 Min. : 1 Min. : 0 Min. :2001 Min. :41.6
1st Qu.: 722 1st Qu.: 6 1st Qu.:22 1st Qu.:2003 1st Qu.:41.8
Median :1121 Median :10 Median :32 Median :2006 Median :41.9
Mean :1259 Mean :12 Mean :38 Mean :2006 Mean :41.8
3rd Qu.:1733 3rd Qu.:17 3rd Qu.:60 3rd Qu.:2009 3rd Qu.:41.9
Max. :2535 Max. :31 Max. :77 Max. :2012 Max. :42.0
NA's :43056 NA's :24616 NA's :2276
Longitude
Min. :-87.9
1st Qu.:-87.7
Median :-87.7
Mean :-87.7
3rd Qu.:-87.6
Max. :-87.5
NA's :2276
【1.3】Using the “max” function, what is the maximum value of the
variable ID
?
【1.4】 What is the minimum value of the variable
Beat
?
【1.5】 How many observations have value TRUE in the
Arrest
variable (this is the number of crimes for which an
arrest was made)?
sum(D$Arrest)
[1] 15536
What is the average arrest rate of car thief events?
【1.6】 How many observations have a LocationDescription
value of ALLEY
?
【2.1】 In what format are the entries in the variable Date?
head(D$Date) # Month/Day/Year Hour:Minute
[1] "12/31/12 23:15" "12/31/12 22:00" "12/31/12 22:00" "12/31/12 22:00" "12/31/12 21:30"
[6] "12/31/12 20:30"
🌞 Date and Time are quite troublesome. Naturally the time line is a
rational linear dimension. However, these nice real-number properties
are all messed up in human society. Months have different days. Week may
span over two months. The worst is, people record date and time in so
many different formats that computers cannot read them automatically.
Therefore, dates and time are usually read in as character strings. We
have to manually look into the strings, recognize the format and then
convert them into Date
or POSIXct
(name of a
standard timing system) data types before we can properly using them in
our programs. For an example, we need to observe the format in
D$date
and convert this chr
column into a
POSIXct
vector (ts
).
= as.POSIXct(D$Date, format="%m/%d/%y %H:%M") ts
🌻 as.POSIXct(x, format)
converts a string vector
x
into a POSIXct
vector.
🌷 Watch how format of the time string is specified in Date and Time Codes (see Table-1 below)
Table-1 Date and Time
Formatting Codes
Now we can make histograms of this time variable (ts
) in
the same way as we do for numerics …
par(mfrow=c(1,2), mar=c(4,3,3,1), cex=0.7)
hist(ts,"year",las=2,freq=T,xlab="",main="Event Counts in Years")
hist(ts,"quarter",las=2,freq=T,xlab="",main="Event Counts in Quartera")
🌻 format()
can convert date/times back to strings in
the desirable format
For examples, if we want to count the car thief events by weekdays …
table( format(ts,'%w') )
0 1 2 3 4 5 6
26316 27397 26791 27416 27319 29284 27118
by hours …
table( format(ts,'%H') ) %>% barplot
by 7 days and 24 hours …
table( weekday=format(ts,'%u'), hour=format(ts,'%H') )
hour
weekday 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15
1 1900 825 712 527 415 542 772 1123 1323 1235 971 737 1129 824 958 1059
2 1691 777 603 464 414 520 845 1118 1175 1174 948 786 1108 762 908 1071
3 1814 790 619 469 396 561 862 1140 1329 1237 947 763 1225 804 863 1075
4 1856 816 696 508 400 534 799 1135 1298 1301 932 731 1093 752 831 1044
5 1873 932 743 560 473 602 839 1203 1268 1286 938 822 1207 857 937 1140
6 2050 1267 985 836 652 508 541 650 858 1039 946 789 1204 767 963 1086
7 2028 1236 1019 838 607 461 478 483 615 864 884 787 1192 789 959 1037
hour
weekday 16 17 18 19 20 21 22 23
1 1136 1252 1518 1503 1622 1815 2009 1490
2 1090 1274 1553 1496 1696 1816 2044 1458
3 1076 1289 1580 1507 1718 1748 2093 1511
4 1131 1258 1510 1537 1668 1776 2134 1579
5 1165 1318 1623 1652 1736 1881 2308 1921
6 1055 1084 1348 1390 1570 1702 2078 1750
7 1083 1160 1389 1342 1706 1696 2079 1584
With a few more line of codes we can transform the table into a heatmap that illustrates the time distribution of car thief events nicely. The plotting function will be elaborated latter in this course
table(format(ts,"%u"), format(ts,"%H")) %>%
as.data.frame() %>% setNames(c("weekday","hour","count")) %>%
ggplot(aes(hour,weekday,fill=count)) +
geom_tile() + coord_fixed(ratio=1) +
scale_fill_gradientn(colors=c("seagreen","yellow","darkred"))
💡 Hints: The most useful functions
in R are …
■ table()
■
tapply()
incorporating with …
■
sort()
, head()
, tail()
■
sum()
, mean
, max()
,
min
You should be able to solve all fo the questions
below
【2.2】 What is the month and year of the median date in our dataset?
#
【2.3】 In which month did the fewest motor vehicle thefts occur?
#
【2.4】 On which weekday did the most motor vehicle thefts occur?
#
【2.5】 Which month has the largest number of motor vehicle thefts for which an arrest was made?
#
【3.1】 In general …
#
【3.2】 Does it look like there were more crimes for which arrests were made in the first half of the time period or the second half of the time period?
#
【3.3】 For what proportion of motor vehicle thefts in 2001 was an arrest made?
#
【3.4】 For what proportion of motor vehicle thefts in 2007 was an arrest made?
#
【3.5】 For what proportion of motor vehicle thefts in 2012 was an arrest made?
#
【4.1】 Which locations are the top five locations for motor vehicle thefts, excluding the “Other” category? You should select 5 of the following options.
#
【4.2】 How many observations are in Top5?
#
【4.3】 One of the locations has a much higher arrest rate than the other locations. Which is it?
#
【4.4】 On which day of the week do the most motor vehicle thefts at gas stations happen?
#
【4.5】 On which day of the week do the fewest motor vehicle thefts in residential driveways happen?
#
= table(D$Loc) %>% sort %>% tail()
L6 %>% mutate(ts = ts) %>%
D filter(LocationDescription %in% names(L6)[-4]) %>%
mutate(Loc = str_remove(LocationDescription, " .*$")) %>%
group_by(Loc, wday=format(ts,"%u"), hour=format(ts,"%H")) %>%
summarise(no.events = n(), arrest.rate = mean(Arrest), .groups="drop") %>%
group_by(Loc) %>% mutate_at(vars(no.events,arrest.rate), scale) %>%
gather("key","value",4:5) %>%
ggplot(aes(x=hour, y=wday,fill=value)) + geom_tile(alpha=0.75) +
scale_fill_gradientn(colors=c("darkgreen","green","yellow","red","darkred")) +
coord_fixed(ratio=1) + facet_grid(Loc~key) + theme_bw() +
theme(axis.ticks=element_blank(),axis.text=element_text(size=6)) +
ggtitle("Vehical Crimes in a Week")
💡 Quiz:
■ What can you
observe from the chart above?
■ If you were law enforcement, how
can the chart help you?
■ If you were a motor thief, how can the
chart help you?
🌷 Note that we haven’t done any solid analysis yet. Simply
exploring the data with table()
and tapply()
,
we can draw a lot of useful information from the data already.