AS2-3: Census Data

In this notebook we will use United States census data sets to learn

the concept of NA - while data is not available
the troublesome effects of NA and how to handle it
merging data frame, which may incur NA

The major data is in data/MetroAreaCodes.csv. We read it into a data frame D.

D = read.csv("data/CPSData.csv")

This is a census data. Each record represents a respondents. Each column captures the respondents’ answer to an item in the census questionnaire. There are 14 columns in D …

PeopleInHousehold: The number of people in the interviewee’s household.
Region: The census region where the interviewee lives.
State: The state where the interviewee lives.
MetroAreaCode: A code that identifies the metropolitan area in which the interviewee lives (missing if the interviewee does not live in a metropolitan area). The mapping from codes to names of metropolitan areas is provided in the file MetroAreaCodes.csv.
Age: The age, in years, of the interviewee. 80 represents people aged 80-84, and 85 represents people aged 85 and higher.
Married: The marriage status of the interviewee.
Sex: The sex of the interviewee.
Education: The maximum level of education obtained by the interviewee.
Race: The race of the interviewee.
Hispanic: Whether the interviewee is of Hispanic ethnicity.
CountryOfBirthCode: A code identifying the country of birth of the interviewee. The mapping from codes to names of countries is provided in the file CountryCodes.csv.
Citizenship: The United States citizenship status of the interviewee.
EmploymentStatus: The status of employment of the interviewee.
Industry: The industry of employment of the interviewee (only available if they are employed).

summary(D)

 PeopleInHousehold    Region             State           MetroAreaCode        Age      
 Min.   : 1.00     Length:131302      Length:131302      Min.   :10420   Min.   : 0.0  
 1st Qu.: 2.00     Class :character   Class :character   1st Qu.:21780   1st Qu.:19.0  
 Median : 3.00     Mode  :character   Mode  :character   Median :34740   Median :39.0  
 Mean   : 3.28                                           Mean   :35075   Mean   :38.8  
 3rd Qu.: 4.00                                           3rd Qu.:41860   3rd Qu.:57.0  
 Max.   :15.00                                           Max.   :79600   Max.   :85.0  
                                                         NA's   :34238                 
   Married              Sex             Education             Race          
 Length:131302      Length:131302      Length:131302      Length:131302     
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
    Hispanic     CountryOfBirthCode Citizenship        EmploymentStatus  
 Min.   :0.000   Min.   : 57.0      Length:131302      Length:131302     
 1st Qu.:0.000   1st Qu.: 57.0      Class :character   Class :character  
 Median :0.000   Median : 57.0      Mode  :character   Mode  :character  
 Mean   :0.139   Mean   : 82.7                                           
 3rd Qu.:0.000   3rd Qu.: 57.0                                           
 Max.   :1.000   Max.   :555.0                                           
                                                                         
   Industry        
 Length:131302     
 Class :character  
 Mode  :character

In the summary we can see that there are 34,238 NA’s in MetroAreaCode. Yet be careful …

🌷 summary does not shows the numbers of NA in the character columns.

The best way to examine NA in a data frame is to pipe is.na() into colSums().

is.na(D) %>% colSums

 PeopleInHousehold             Region              State      MetroAreaCode 
                 0                  0                  0              34238 
               Age            Married                Sex          Education 
                 0              25338                  0              25338 
              Race           Hispanic CountryOfBirthCode        Citizenship 
                 0                  0                  0                  0 
  EmploymentStatus           Industry 
             25789              65060

now we see there are NA’s in Married, Education, EmploymentStatus and Industry. In the case of census, NA may occur when question items are …

not answered
answered improperly
not applicable to the respondents

Section-1 Loading and Summarizing the Dataset

§ 1.1 How many interviewees are in the dataset?

#
#

§ 1.2 Among the interviewees with a value reported for the Industry variable, what is the most common industry of employment? Please enter the name exactly how you see it.

#
#

§ 1.3 Which state has the fewest interviewees?

#
#

Which state has the largest number of interviewees?

#
#

§ 1.4 What proportion of interviewees are citizens of the United States?

#
#

§ 1.5 For which races are there at least 250 interviewees in the CPS dataset of Hispanic ethnicity?

#
#

Section-2 Evaluating Missing Values

§ 2.1 Which variables have at least one interviewee with a missing (NA) value?

#
#

§ 2.2 Which statements below is the most accurate:

The Married variable being missing is related to the Region value for the interviewee.
The Married variable being missing is related to the Sex value for the interviewee.
The Married variable being missing is related to the Age value for the interviewee.
The Married variable being missing is related to the Citizenship value for the interviewee.
The Married variable being missing is not related to the Region, Sex, Age, or Citizenship value for the interviewee.

#
#

🌻 Married is not applicable for interviewees who is younger than 15 years old. This type of NA occurs systematically. They are not random.

§ 2.3 How many states had all interviewees living in a non-metropolitan area (aka they have a missing MetroAreaCode value)? For this question, treat the District of Columbia as a state (even though it is not technically a state).

#
#

How many states had all interviewees living in a metropolitan area? Again, treat the District of Columbia as a state.

#
#

§ 2.4 Which region of the United States has the largest proportion of interviewees living in a non-metropolitan area?

#
#

§ 2.5 Which state has a proportion of interviewees living in a non-metropolitan area closest to 30%?

#
#

Which state has the largest proportion of non-metropolitan interviewees, ignoring states where all interviewees were non-metropolitan?

#
#

🌻 As we can see in the questions above, NA’s are not completely useless. Sometimes they carry useful information.

Section-3 Integrating Metropolitan Area Data

In this exercise, we have two more data files

data/MetroAreaCodes.csv maps D$MetroAreaCode to the names of the Metro Area’s
data/CountryCodes.csv maps D$CountryCode to the names of the Countries

metro = read.csv("data/MetroAreaCodes.csv")
country = read.csv("data/CountryCodes.csv")

§ 3.1 How many observations (codes for metropolitan areas) are there in MetroAreaMap?

# 
#

How many observations (codes for countries) are there in CountryMap?

#
#

🌻 merge(x, y, by.x, by.y, all.x, all.y) merges two data frames x and y

by.x and by.y specify the merging keys in x and y respectively.
all.x and all.y specify whether all rows in x and y should be kept. The defaults are FALSE which implies rows that do not match are removed. When set to TRUE, the unmatched rows stay with the merged columns set to NA.

§ 3.2 What is the name of the variable that was added to the data frame by the merge() operation?

D = merge(D, metro, by.x="MetroAreaCode", by.y="Code", all.x=TRUE)

How many interviewees have a missing value for the new metropolitan area variable?

#
#

§ 3.3 Which of the following metropolitan areas has the largest number of interviewees?

#
#

§ 3.4 Which metropolitan area has the highest proportion of interviewees of Hispanic ethnicity?

#
#

§ 3.5 Determine the number of metropolitan areas in the United States from which at least 20% of interviewees are Asian.

#
#

🌻 All mathematical and logical function return NA if there is any NA in their input vector(s). As an remedy, the na.rm=TRUE argument can exclude the NA’s from the input before calculation.

§ 3.6 Passing na.rm=TRUE to the tapply function, determine which metropolitan area has the smallest proportion of interviewees who have received no high school diploma.

#
#

Section-4 Integrating Country of Birth Data

§ 4.1 What is the name of the variable added to the CPS data frame by this merge operation?

#
#

How many interviewees have a missing value for the new metropolitan area variable?

#
#

§ 4.2 Among all interviewees born outside of North America, which country was the most common place of birth?

#
#

§ 4.3 What proportion of the interviewees from the “New York-Northern New Jersey-Long Island, NY-NJ-PA” metropolitan area have a country of birth that is not the United States?

#
#

§ 4.4 Which metropolitan area has the largest number (note – not proportion) of interviewees with a country of birth in India?

#
#

In Brazil?

#
#

In Somalia?

#
#

AS2-3: Census Data

Group-00

2021-10-02 17:36:22

Section-1 Loading and Summarizing the Dataset

Section-2 Evaluating Missing Values

Section-3 Integrating Metropolitan Area Data

Section-4 Integrating Country of Birth Data