In this notebook we will use United States census data sets to learn
NA
- while data is not availableNA
and how to handle itNA
The major data is in data/MetroAreaCodes.csv
. We read it into a data frame D
.
= read.csv("data/CPSData.csv") D
This is a census data. Each record represents a respondents. Each column captures the respondents’ answer to an item in the census questionnaire. There are 14 columns in D
…
PeopleInHousehold
: The number of people in the interviewee’s household.Region
: The census region where the interviewee lives.State
: The state where the interviewee lives.MetroAreaCode
: A code that identifies the metropolitan area in which the interviewee lives (missing if the interviewee does not live in a metropolitan area). The mapping from codes to names of metropolitan areas is provided in the file MetroAreaCodes.csv.Age
: The age, in years, of the interviewee. 80 represents people aged 80-84, and 85 represents people aged 85 and higher.Married
: The marriage status of the interviewee.Sex
: The sex of the interviewee.Education
: The maximum level of education obtained by the interviewee.Race
: The race of the interviewee.Hispanic
: Whether the interviewee is of Hispanic ethnicity.CountryOfBirthCode
: A code identifying the country of birth of the interviewee. The mapping from codes to names of countries is provided in the file CountryCodes.csv.Citizenship
: The United States citizenship status of the interviewee.EmploymentStatus
: The status of employment of the interviewee.Industry
: The industry of employment of the interviewee (only available if they are employed).summary(D)
PeopleInHousehold Region State MetroAreaCode Age
Min. : 1.00 Length:131302 Length:131302 Min. :10420 Min. : 0.0
1st Qu.: 2.00 Class :character Class :character 1st Qu.:21780 1st Qu.:19.0
Median : 3.00 Mode :character Mode :character Median :34740 Median :39.0
Mean : 3.28 Mean :35075 Mean :38.8
3rd Qu.: 4.00 3rd Qu.:41860 3rd Qu.:57.0
Max. :15.00 Max. :79600 Max. :85.0
NA's :34238
Married Sex Education Race
Length:131302 Length:131302 Length:131302 Length:131302
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
Hispanic CountryOfBirthCode Citizenship EmploymentStatus
Min. :0.000 Min. : 57.0 Length:131302 Length:131302
1st Qu.:0.000 1st Qu.: 57.0 Class :character Class :character
Median :0.000 Median : 57.0 Mode :character Mode :character
Mean :0.139 Mean : 82.7
3rd Qu.:0.000 3rd Qu.: 57.0
Max. :1.000 Max. :555.0
Industry
Length:131302
Class :character
Mode :character
In the summary
we can see that there are 34,238 NA
’s in MetroAreaCode
. Yet be careful …
🌷 summary
does not shows the numbers of NA
in the character columns.
The best way to examine NA
in a data frame is to pipe is.na()
into colSums()
.
is.na(D) %>% colSums
PeopleInHousehold Region State MetroAreaCode
0 0 0 34238
Age Married Sex Education
0 25338 0 25338
Race Hispanic CountryOfBirthCode Citizenship
0 0 0 0
EmploymentStatus Industry
25789 65060
now we see there are NA
’s in Married
, Education
, EmploymentStatus
and Industry
. In the case of census, NA
may occur when question items are …
§ 1.1 How many interviewees are in the dataset?
#
#
§ 1.2 Among the interviewees with a value reported for the Industry variable, what is the most common industry of employment? Please enter the name exactly how you see it.
#
#
§ 1.3 Which state has the fewest interviewees?
#
#
Which state has the largest number of interviewees?
#
#
§ 1.4 What proportion of interviewees are citizens of the United States?
#
#
§ 1.5 For which races are there at least 250 interviewees in the CPS dataset of Hispanic ethnicity?
#
#
§ 2.1 Which variables have at least one interviewee with a missing (NA) value?
#
#
§ 2.2 Which statements below is the most accurate:
#
#
🌻 Married is not applicable for interviewees who is younger than 15 years old. This type of NA
occurs systematically. They are not random.
§ 2.3 How many states had all interviewees living in a non-metropolitan area (aka they have a missing MetroAreaCode
value)? For this question, treat the District of Columbia as a state (even though it is not technically a state).
#
#
How many states had all interviewees living in a metropolitan area? Again, treat the District of Columbia as a state.
#
#
§ 2.4 Which region of the United States has the largest proportion of interviewees living in a non-metropolitan area?
#
#
§ 2.5 Which state has a proportion of interviewees living in a non-metropolitan area closest to 30%?
#
#
Which state has the largest proportion of non-metropolitan interviewees, ignoring states where all interviewees were non-metropolitan?
#
#
🌻 As we can see in the questions above, NA
’s are not completely useless. Sometimes they carry useful information.
In this exercise, we have two more data files
data/MetroAreaCodes.csv
maps D$MetroAreaCode
to the names of the Metro Area’sdata/CountryCodes.csv
maps D$CountryCode
to the names of the Countries= read.csv("data/MetroAreaCodes.csv")
metro = read.csv("data/CountryCodes.csv") country
§ 3.1 How many observations (codes for metropolitan areas) are there in MetroAreaMap
?
#
#
How many observations (codes for countries) are there in CountryMap?
#
#
🌻 merge(x, y, by.x, by.y, all.x, all.y)
merges two data frames x
and y
by.x
and by.y
specify the merging keys in x
and y
respectively.all.x
and all.y
specify whether all rows in x
and y
should be kept. The defaults are FALSE
which implies rows that do not match are removed. When set to TRUE
, the unmatched rows stay with the merged columns set to NA
.§ 3.2 What is the name of the variable that was added to the data frame by the merge() operation?
= merge(D, metro, by.x="MetroAreaCode", by.y="Code", all.x=TRUE) D
How many interviewees have a missing value for the new metropolitan area variable?
#
#
§ 3.3 Which of the following metropolitan areas has the largest number of interviewees?
#
#
§ 3.4 Which metropolitan area has the highest proportion of interviewees of Hispanic ethnicity?
#
#
§ 3.5 Determine the number of metropolitan areas in the United States from which at least 20% of interviewees are Asian.
#
#
🌻 All mathematical and logical function return NA
if there is any NA
in their input vector(s). As an remedy, the na.rm=TRUE
argument can exclude the NA
’s from the input before calculation.
§ 3.6 Passing na.rm=TRUE
to the tapply
function, determine which metropolitan area has the smallest proportion of interviewees who have received no high school diploma.
#
#
§ 4.1 What is the name of the variable added to the CPS data frame by this merge operation?
#
#
How many interviewees have a missing value for the new metropolitan area variable?
#
#
§ 4.2 Among all interviewees born outside of North America, which country was the most common place of birth?
#
#
§ 4.3 What proportion of the interviewees from the “New York-Northern New Jersey-Long Island, NY-NJ-PA” metropolitan area have a country of birth that is not the United States?
#
#
§ 4.4 Which metropolitan area has the largest number (note – not proportion) of interviewees with a country of birth in India?
#
#
In Brazil?
#
#
In Somalia?
#
#