In this notebook we will use United States census data sets to learn

The major data is in data/MetroAreaCodes.csv. We read it into a data frame D.

D = read.csv("data/CPSData.csv")

This is a census data. Each record represents a respondents. Each column captures the respondents’ answer to an item in the census questionnaire. There are 14 columns in D


summary(D)
 PeopleInHousehold    Region             State           MetroAreaCode        Age      
 Min.   : 1.00     Length:131302      Length:131302      Min.   :10420   Min.   : 0.0  
 1st Qu.: 2.00     Class :character   Class :character   1st Qu.:21780   1st Qu.:19.0  
 Median : 3.00     Mode  :character   Mode  :character   Median :34740   Median :39.0  
 Mean   : 3.28                                           Mean   :35075   Mean   :38.8  
 3rd Qu.: 4.00                                           3rd Qu.:41860   3rd Qu.:57.0  
 Max.   :15.00                                           Max.   :79600   Max.   :85.0  
                                                         NA's   :34238                 
   Married              Sex             Education             Race          
 Length:131302      Length:131302      Length:131302      Length:131302     
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
    Hispanic     CountryOfBirthCode Citizenship        EmploymentStatus  
 Min.   :0.000   Min.   : 57.0      Length:131302      Length:131302     
 1st Qu.:0.000   1st Qu.: 57.0      Class :character   Class :character  
 Median :0.000   Median : 57.0      Mode  :character   Mode  :character  
 Mean   :0.139   Mean   : 82.7                                           
 3rd Qu.:0.000   3rd Qu.: 57.0                                           
 Max.   :1.000   Max.   :555.0                                           
                                                                         
   Industry        
 Length:131302     
 Class :character  
 Mode  :character  
                   
                   
                   
                   

In the summary we can see that there are 34,238 NA’s in MetroAreaCode. Yet be careful …

🌷 summary does not shows the numbers of NA in the character columns.

The best way to examine NA in a data frame is to pipe is.na() into colSums().

is.na(D) %>% colSums
 PeopleInHousehold             Region              State      MetroAreaCode 
                 0                  0                  0              34238 
               Age            Married                Sex          Education 
                 0              25338                  0              25338 
              Race           Hispanic CountryOfBirthCode        Citizenship 
                 0                  0                  0                  0 
  EmploymentStatus           Industry 
             25789              65060 

now we see there are NA’s in Married, Education, EmploymentStatus and Industry. In the case of census, NA may occur when question items are …



Section-1 Loading and Summarizing the Dataset

§ 1.1 How many interviewees are in the dataset?

#
#

§ 1.2 Among the interviewees with a value reported for the Industry variable, what is the most common industry of employment? Please enter the name exactly how you see it.

#
#

§ 1.3 Which state has the fewest interviewees?

#
#

Which state has the largest number of interviewees?

#
#

§ 1.4 What proportion of interviewees are citizens of the United States?

#
#

§ 1.5 For which races are there at least 250 interviewees in the CPS dataset of Hispanic ethnicity?

#
#



Section-2 Evaluating Missing Values

§ 2.1 Which variables have at least one interviewee with a missing (NA) value?

#
#

§ 2.2 Which statements below is the most accurate:

#
#

🌻 Married is not applicable for interviewees who is younger than 15 years old. This type of NA occurs systematically. They are not random.

§ 2.3 How many states had all interviewees living in a non-metropolitan area (aka they have a missing MetroAreaCode value)? For this question, treat the District of Columbia as a state (even though it is not technically a state).

#
#

How many states had all interviewees living in a metropolitan area? Again, treat the District of Columbia as a state.

#
#

§ 2.4 Which region of the United States has the largest proportion of interviewees living in a non-metropolitan area?

#
#

§ 2.5 Which state has a proportion of interviewees living in a non-metropolitan area closest to 30%?

#
#

Which state has the largest proportion of non-metropolitan interviewees, ignoring states where all interviewees were non-metropolitan?

#
#

🌻 As we can see in the questions above, NA’s are not completely useless. Sometimes they carry useful information.



Section-3 Integrating Metropolitan Area Data

In this exercise, we have two more data files

metro = read.csv("data/MetroAreaCodes.csv")
country = read.csv("data/CountryCodes.csv")

§ 3.1 How many observations (codes for metropolitan areas) are there in MetroAreaMap?

# 
#

How many observations (codes for countries) are there in CountryMap?

#
#

🌻 merge(x, y, by.x, by.y, all.x, all.y) merges two data frames x and y

§ 3.2 What is the name of the variable that was added to the data frame by the merge() operation?

D = merge(D, metro, by.x="MetroAreaCode", by.y="Code", all.x=TRUE)

How many interviewees have a missing value for the new metropolitan area variable?

#
#

§ 3.3 Which of the following metropolitan areas has the largest number of interviewees?

#
#

§ 3.4 Which metropolitan area has the highest proportion of interviewees of Hispanic ethnicity?

#
#

§ 3.5 Determine the number of metropolitan areas in the United States from which at least 20% of interviewees are Asian.

#
#

🌻 All mathematical and logical function return NA if there is any NA in their input vector(s). As an remedy, the na.rm=TRUE argument can exclude the NA’s from the input before calculation.

§ 3.6 Passing na.rm=TRUE to the tapply function, determine which metropolitan area has the smallest proportion of interviewees who have received no high school diploma.

#
#



Section-4 Integrating Country of Birth Data

§ 4.1 What is the name of the variable added to the CPS data frame by this merge operation?

#
#

How many interviewees have a missing value for the new metropolitan area variable?

#
#

§ 4.2 Among all interviewees born outside of North America, which country was the most common place of birth?

#
#

§ 4.3 What proportion of the interviewees from the “New York-Northern New Jersey-Long Island, NY-NJ-PA” metropolitan area have a country of birth that is not the United States?

#
#

§ 4.4 Which metropolitan area has the largest number (note – not proportion) of interviewees with a country of birth in India?

#
#

In Brazil?

#
#

In Somalia?

#
#