The major data is in data/MetroAreaCodes.csv. We read it
into a data frame D.
D = read.csv("data/CPSData.csv")D
之中每一筆紀錄是一個受測對象,每一個欄位是一個調查問卷題項:
PeopleInHousehold: 受訪者家庭中的人數。Region: 受訪者居住的人口普查區域。State: 受訪者居住的州。MetroAreaCode:
都會區代碼,如受訪者不住都會區,則為NA;從代碼到都會區名稱的對應在MetroAreaCodes.csv中提供。Age: 受訪者的年齡,以年為單位。
80代表80-84歲的人,85代表85歲及以上的人。Married: 受訪者的婚姻狀況。Sex: 受訪者的性別。Education: 受訪者獲得的最高教育程度。Race: 受訪者的種族。Hispanic: 受訪者是否屬於西班牙裔。CountryOfBirthcode:
識別受訪者出生國家的代碼。從代碼到國家名稱的映射在CountryCodes.csv文件中提供。Citizenship: 受訪者的公民身份。EmploymentStatus: 受訪者的就業狀況。Industry:
受訪者的就業行業(僅在受僱的情況下可用)。🌷 這份作業的學習重點是:
NA) 的觀念和處理方法summary(D) PeopleInHousehold Region State MetroAreaCode Age
Min. : 1.00 Length:131302 Length:131302 Min. :10420 Min. : 0.0
1st Qu.: 2.00 Class :character Class :character 1st Qu.:21780 1st Qu.:19.0
Median : 3.00 Mode :character Mode :character Median :34740 Median :39.0
Mean : 3.28 Mean :35075 Mean :38.8
3rd Qu.: 4.00 3rd Qu.:41860 3rd Qu.:57.0
Max. :15.00 Max. :79600 Max. :85.0
NA's :34238
Married Sex Education Race
Length:131302 Length:131302 Length:131302 Length:131302
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
Hispanic CountryOfBirthCode Citizenship EmploymentStatus
Min. :0.000 Min. : 57.0 Length:131302 Length:131302
1st Qu.:0.000 1st Qu.: 57.0 Class :character Class :character
Median :0.000 Median : 57.0 Mode :character Mode :character
Mean :0.139 Mean : 82.7
3rd Qu.:0.000 3rd Qu.: 57.0
Max. :1.000 Max. :555.0
Industry
Length:131302
Class :character
Mode :character
In the summary we can see that there are 34,238
NA’s in MetroAreaCode. Yet be careful …
🌷 summary does not shows the numbers of NA
in the character columns.
The best way to examine NA in a data frame is to pipe
is.na() into colSums().
is.na(D) %>% colSums PeopleInHousehold Region State MetroAreaCode
0 0 0 34238
Age Married Sex Education
0 25338 0 25338
Race Hispanic CountryOfBirthCode Citizenship
0 0 0 0
EmploymentStatus Industry
25789 65060
now we see there are NA’s in Married,
Education, EmploymentStatus and
Industry. In the case of census, NA may occur
when question items are …
§ 1.1 How many interviewees are in the dataset?
#
#§ 1.2 Among the interviewees with a value reported for the Industry variable, what is the most common industry of employment? Please enter the name exactly how you see it.
#
#§ 1.3 Which state has the fewest interviewees?
#
#Which state has the largest number of interviewees?
#
#§ 1.4 What proportion of interviewees are citizens of the United States?
#
#§ 1.5 For which races are there at least 250 interviewees in the CPS dataset of Hispanic ethnicity?
#
#§ 2.1 Which variables have at least one interviewee with a missing (NA) value?
#
#§ 2.2 Which statements below is the most accurate:
#
#🌻 Married is not applicable for interviewees who is younger than 15
years old. This type of NA occurs systematically. They are
not random.
§ 2.3 How many states had all interviewees living in
a non-metropolitan area (aka they have a missing
MetroAreaCode value)? For this question, treat the District
of Columbia as a state (even though it is not technically a state).
#
#How many states had all interviewees living in a metropolitan area? Again, treat the District of Columbia as a state.
#
#§ 2.4 Which region of the United States has the largest proportion of interviewees living in a non-metropolitan area?
#
#§ 2.5 Which state has a proportion of interviewees living in a non-metropolitan area closest to 30%?
#
#Which state has the largest proportion of non-metropolitan interviewees, ignoring states where all interviewees were non-metropolitan?
#
#🌻 As we can see in the questions above, NA’s are not
completely useless. Sometimes they carry useful information.
In this exercise, we have two more data files
data/MetroAreaCodes.csv maps
D$MetroAreaCode to the names of the Metro Area’sdata/CountryCodes.csv maps D$CountryCode
to the names of the Countriesmetro = read.csv("data/MetroAreaCodes.csv")
country = read.csv("data/CountryCodes.csv")§ 3.1 How many observations (codes for metropolitan
areas) are there in MetroAreaMap?
#
#How many observations (codes for countries) are there in CountryMap?
#
#🌻 merge(x, y, by.x, by.y, all.x, all.y) merges two data
frames x and y
by.x and by.y specify the merging keys in
x and y respectively.all.x and all.y specify whether all rows
in x and y should be kept. The defaults are
FALSE which implies rows that do not match are removed.
When set to TRUE, the unmatched rows stay with the merged
columns set to NA.§ 3.2 Merge the MetroArea variable into
the D data frame by merging keys MetroAreaCode
and Code. What is the name of the variable that was added
to the data frame by the merge() operation?
D = merge(D, metro, by.x="MetroAreaCode", by.y="Code", all.x=TRUE)How many interviewees have a missing value for the new metropolitan area variable?
#
#§ 3.3 Which of the following metropolitan areas has the largest number of interviewees?
#
#§ 3.4 Which metropolitan area has the highest proportion of interviewees of Hispanic ethnicity?
#
#§ 3.5 Determine the number of metropolitan areas in the United States from which at least 20% of interviewees are Asian.
#
#🌻 All mathematical and logical function return NA if
there is any NA in their input vector(s). As an remedy, the
na.rm=TRUE argument can exclude the NA’s from
the input before calculation.
§ 3.6 Passing na.rm=TRUE to the
tapply function, determine which metropolitan area has the
smallest proportion of interviewees who have received no high school
diploma.
#
#§ 4.1 Merge the Country variable into
the D data frame by merging keys
CountryOfBirthCode and Code. What is the name
of the variable added to the data frame by this merge operation?
D = merge(D, country, by.x="CountryOfBirthCode", by.y="Code", all.x=TRUE)How many interviewees have a missing value for the new country variable?
#
#§ 4.2 Among all interviewees born outside of North America, which country was the most common place of birth?
#
#§ 4.3 What proportion of the interviewees from the “New York-Northern New Jersey-Long Island, NY-NJ-PA” metropolitan area have a country of birth that is not the United States?
#
#§ 4.4 Which metropolitan area has the largest number (note – not proportion) of interviewees with a country of birth in India?
#
#In Brazil?
#
#In Somalia?
#
#