The major data is in data/MetroAreaCodes.csv
. We read it
into a data frame D
.
= read.csv("data/CPSData.csv") D
D
之中每一筆紀錄是一個受測對象,每一個欄位是一個調查問卷題項:
PeopleInHousehold
: 受訪者家庭中的人數。Region
: 受訪者居住的人口普查區域。State
: 受訪者居住的州。MetroAreaCode
:
都會區代碼,如受訪者不住都會區,則為NA;從代碼到都會區名稱的對應在MetroAreaCodes.csv
中提供。Age
: 受訪者的年齡,以年為單位。
80代表80-84歲的人,85代表85歲及以上的人。Married
: 受訪者的婚姻狀況。Sex
: 受訪者的性別。Education
: 受訪者獲得的最高教育程度。Race
: 受訪者的種族。Hispanic
: 受訪者是否屬於西班牙裔。CountryOfBirthcode
:
識別受訪者出生國家的代碼。從代碼到國家名稱的映射在CountryCodes.csv文件中提供。Citizenship
: 受訪者的公民身份。EmploymentStatus
: 受訪者的就業狀況。Industry
:
受訪者的就業行業(僅在受僱的情況下可用)。🌷 這份作業的學習重點是:
NA
) 的觀念和處理方法summary(D)
PeopleInHousehold Region State MetroAreaCode Age
Min. : 1.00 Length:131302 Length:131302 Min. :10420 Min. : 0.0
1st Qu.: 2.00 Class :character Class :character 1st Qu.:21780 1st Qu.:19.0
Median : 3.00 Mode :character Mode :character Median :34740 Median :39.0
Mean : 3.28 Mean :35075 Mean :38.8
3rd Qu.: 4.00 3rd Qu.:41860 3rd Qu.:57.0
Max. :15.00 Max. :79600 Max. :85.0
NA's :34238
Married Sex Education Race
Length:131302 Length:131302 Length:131302 Length:131302
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
Hispanic CountryOfBirthCode Citizenship EmploymentStatus
Min. :0.000 Min. : 57.0 Length:131302 Length:131302
1st Qu.:0.000 1st Qu.: 57.0 Class :character Class :character
Median :0.000 Median : 57.0 Mode :character Mode :character
Mean :0.139 Mean : 82.7
3rd Qu.:0.000 3rd Qu.: 57.0
Max. :1.000 Max. :555.0
Industry
Length:131302
Class :character
Mode :character
In the summary
we can see that there are 34,238
NA
’s in MetroAreaCode
. Yet be careful …
🌷 summary
does not shows the numbers of NA
in the character columns.
The best way to examine NA
in a data frame is to pipe
is.na()
into colSums()
.
is.na(D) %>% colSums
PeopleInHousehold Region State MetroAreaCode
0 0 0 34238
Age Married Sex Education
0 25338 0 25338
Race Hispanic CountryOfBirthCode Citizenship
0 0 0 0
EmploymentStatus Industry
25789 65060
now we see there are NA
’s in Married
,
Education
, EmploymentStatus
and
Industry
. In the case of census, NA
may occur
when question items are …
§ 1.1 How many interviewees are in the dataset?
#
#
§ 1.2 Among the interviewees with a value reported for the Industry variable, what is the most common industry of employment? Please enter the name exactly how you see it.
#
#
§ 1.3 Which state has the fewest interviewees?
#
#
Which state has the largest number of interviewees?
#
#
§ 1.4 What proportion of interviewees are citizens of the United States?
#
#
§ 1.5 For which races are there at least 250 interviewees in the CPS dataset of Hispanic ethnicity?
#
#
§ 2.1 Which variables have at least one interviewee with a missing (NA) value?
#
#
§ 2.2 Which statements below is the most accurate:
#
#
🌻 Married is not applicable for interviewees who is younger than 15
years old. This type of NA
occurs systematically. They are
not random.
§ 2.3 How many states had all interviewees living in
a non-metropolitan area (aka they have a missing
MetroAreaCode
value)? For this question, treat the District
of Columbia as a state (even though it is not technically a state).
#
#
How many states had all interviewees living in a metropolitan area? Again, treat the District of Columbia as a state.
#
#
§ 2.4 Which region of the United States has the largest proportion of interviewees living in a non-metropolitan area?
#
#
§ 2.5 Which state has a proportion of interviewees living in a non-metropolitan area closest to 30%?
#
#
Which state has the largest proportion of non-metropolitan interviewees, ignoring states where all interviewees were non-metropolitan?
#
#
🌻 As we can see in the questions above, NA
’s are not
completely useless. Sometimes they carry useful information.
In this exercise, we have two more data files
data/MetroAreaCodes.csv
maps
D$MetroAreaCode
to the names of the Metro Area’sdata/CountryCodes.csv
maps D$CountryCode
to the names of the Countries= read.csv("data/MetroAreaCodes.csv")
metro = read.csv("data/CountryCodes.csv") country
§ 3.1 How many observations (codes for metropolitan
areas) are there in MetroAreaMap
?
#
#
How many observations (codes for countries) are there in CountryMap?
#
#
🌻 merge(x, y, by.x, by.y, all.x, all.y)
merges two data
frames x
and y
by.x
and by.y
specify the merging keys in
x
and y
respectively.all.x
and all.y
specify whether all rows
in x
and y
should be kept. The defaults are
FALSE
which implies rows that do not match are removed.
When set to TRUE
, the unmatched rows stay with the merged
columns set to NA
.§ 3.2 Merge the MetroArea
variable into
the D
data frame by merging keys MetroAreaCode
and Code
. What is the name of the variable that was added
to the data frame by the merge() operation?
= merge(D, metro, by.x="MetroAreaCode", by.y="Code", all.x=TRUE) D
How many interviewees have a missing value for the new metropolitan area variable?
#
#
§ 3.3 Which of the following metropolitan areas has the largest number of interviewees?
#
#
§ 3.4 Which metropolitan area has the highest proportion of interviewees of Hispanic ethnicity?
#
#
§ 3.5 Determine the number of metropolitan areas in the United States from which at least 20% of interviewees are Asian.
#
#
🌻 All mathematical and logical function return NA
if
there is any NA
in their input vector(s). As an remedy, the
na.rm=TRUE
argument can exclude the NA
’s from
the input before calculation.
§ 3.6 Passing na.rm=TRUE
to the
tapply
function, determine which metropolitan area has the
smallest proportion of interviewees who have received no high school
diploma.
#
#
§ 4.1 Merge the Country
variable into
the D
data frame by merging keys
CountryOfBirthCode
and Code
. What is the name
of the variable added to the data frame by this merge operation?
= merge(D, country, by.x="CountryOfBirthCode", by.y="Code", all.x=TRUE) D
How many interviewees have a missing value for the new country variable?
#
#
§ 4.2 Among all interviewees born outside of North America, which country was the most common place of birth?
#
#
§ 4.3 What proportion of the interviewees from the “New York-Northern New Jersey-Long Island, NY-NJ-PA” metropolitan area have a country of birth that is not the United States?
#
#
§ 4.4 Which metropolitan area has the largest number (note – not proportion) of interviewees with a country of birth in India?
#
#
In Brazil?
#
#
In Somalia?
#
#