AS2-3: 美國人口普查資料

The major data is in data/MetroAreaCodes.csv. We read it into a data frame D.

D = read.csv("data/CPSData.csv")

D 之中每一筆紀錄是一個受測對象，每一個欄位是一個調查問卷題項：

PeopleInHousehold: 受訪者家庭中的人數。
Region: 受訪者居住的人口普查區域。
State: 受訪者居住的州。
MetroAreaCode: 都會區代碼，如受訪者不住都會區，則為NA；從代碼到都會區名稱的對應在MetroAreaCodes.csv中提供。
Age: 受訪者的年齡，以年為單位。 80代表80-84歲的人，85代表85歲及以上的人。
Married: 受訪者的婚姻狀況。
Sex: 受訪者的性別。
Education: 受訪者獲得的最高教育程度。
Race: 受訪者的種族。
Hispanic: 受訪者是否屬於西班牙裔。
CountryOfBirthcode: 識別受訪者出生國家的代碼。從代碼到國家名稱的映射在CountryCodes.csv文件中提供。
Citizenship: 受訪者的公民身份。
EmploymentStatus: 受訪者的就業狀況。
Industry: 受訪者的就業行業（僅在受僱的情況下可用）。

🌷 這份作業的學習重點是：

缺項 (NA) 的觀念和處理方法
資料表格(data frame)的合併

summary(D)

 PeopleInHousehold    Region             State           MetroAreaCode        Age      
 Min.   : 1.00     Length:131302      Length:131302      Min.   :10420   Min.   : 0.0  
 1st Qu.: 2.00     Class :character   Class :character   1st Qu.:21780   1st Qu.:19.0  
 Median : 3.00     Mode  :character   Mode  :character   Median :34740   Median :39.0  
 Mean   : 3.28                                           Mean   :35075   Mean   :38.8  
 3rd Qu.: 4.00                                           3rd Qu.:41860   3rd Qu.:57.0  
 Max.   :15.00                                           Max.   :79600   Max.   :85.0  
                                                         NA's   :34238                 
   Married              Sex             Education             Race          
 Length:131302      Length:131302      Length:131302      Length:131302     
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
    Hispanic     CountryOfBirthCode Citizenship        EmploymentStatus  
 Min.   :0.000   Min.   : 57.0      Length:131302      Length:131302     
 1st Qu.:0.000   1st Qu.: 57.0      Class :character   Class :character  
 Median :0.000   Median : 57.0      Mode  :character   Mode  :character  
 Mean   :0.139   Mean   : 82.7                                           
 3rd Qu.:0.000   3rd Qu.: 57.0                                           
 Max.   :1.000   Max.   :555.0                                           
                                                                         
   Industry        
 Length:131302     
 Class :character  
 Mode  :character

In the summary we can see that there are 34,238 NA’s in MetroAreaCode. Yet be careful …

🌷 summary does not shows the numbers of NA in the character columns.

The best way to examine NA in a data frame is to pipe is.na() into colSums().

is.na(D) %>% colSums

 PeopleInHousehold             Region              State      MetroAreaCode 
                 0                  0                  0              34238 
               Age            Married                Sex          Education 
                 0              25338                  0              25338 
              Race           Hispanic CountryOfBirthCode        Citizenship 
                 0                  0                  0                  0 
  EmploymentStatus           Industry 
             25789              65060

now we see there are NA’s in Married, Education, EmploymentStatus and Industry. In the case of census, NA may occur when question items are …

not answered
answered improperly
not applicable to the respondents

Section-1 Loading and Summarizing the Dataset

§ 1.1 How many interviewees are in the dataset?

#
#

§ 1.2 Among the interviewees with a value reported for the Industry variable, what is the most common industry of employment? Please enter the name exactly how you see it.

#
#

§ 1.3 Which state has the fewest interviewees?

#
#

Which state has the largest number of interviewees?

#
#

§ 1.4 What proportion of interviewees are citizens of the United States?

#
#

§ 1.5 For which races are there at least 250 interviewees in the CPS dataset of Hispanic ethnicity?

#
#

Section-2 Evaluating Missing Values

§ 2.1 Which variables have at least one interviewee with a missing (NA) value?

#
#

§ 2.2 Which statements below is the most accurate:

The Married variable being missing is related to the Region value for the interviewee.
The Married variable being missing is related to the Sex value for the interviewee.
The Married variable being missing is related to the Age value for the interviewee.
The Married variable being missing is related to the Citizenship value for the interviewee.
The Married variable being missing is not related to the Region, Sex, Age, or Citizenship value for the interviewee.

#
#

🌻 Married is not applicable for interviewees who is younger than 15 years old. This type of NA occurs systematically. They are not random.

§ 2.3 How many states had all interviewees living in a non-metropolitan area (aka they have a missing MetroAreaCode value)? For this question, treat the District of Columbia as a state (even though it is not technically a state).

#
#

How many states had all interviewees living in a metropolitan area? Again, treat the District of Columbia as a state.

#
#

§ 2.4 Which region of the United States has the largest proportion of interviewees living in a non-metropolitan area?

#
#

§ 2.5 Which state has a proportion of interviewees living in a non-metropolitan area closest to 30%?

#
#

Which state has the largest proportion of non-metropolitan interviewees, ignoring states where all interviewees were non-metropolitan?

#
#

🌻 As we can see in the questions above, NA’s are not completely useless. Sometimes they carry useful information.

Section-3 Integrating Metropolitan Area Data

In this exercise, we have two more data files

data/MetroAreaCodes.csv maps D$MetroAreaCode to the names of the Metro Area’s
data/CountryCodes.csv maps D$CountryCode to the names of the Countries

metro = read.csv("data/MetroAreaCodes.csv")
country = read.csv("data/CountryCodes.csv")

§ 3.1 How many observations (codes for metropolitan areas) are there in MetroAreaMap?

# 
#

How many observations (codes for countries) are there in CountryMap?

#
#

🌻 merge(x, y, by.x, by.y, all.x, all.y) merges two data frames x and y

by.x and by.y specify the merging keys in x and y respectively.
all.x and all.y specify whether all rows in x and y should be kept. The defaults are FALSE which implies rows that do not match are removed. When set to TRUE, the unmatched rows stay with the merged columns set to NA.

§ 3.2 Merge the MetroArea variable into the D data frame by merging keys MetroAreaCode and Code. What is the name of the variable that was added to the data frame by the merge() operation?

D = merge(D, metro, by.x="MetroAreaCode", by.y="Code", all.x=TRUE)

How many interviewees have a missing value for the new metropolitan area variable?

#
#

§ 3.3 Which of the following metropolitan areas has the largest number of interviewees?

#
#

§ 3.4 Which metropolitan area has the highest proportion of interviewees of Hispanic ethnicity?

#
#

§ 3.5 Determine the number of metropolitan areas in the United States from which at least 20% of interviewees are Asian.

#
#

🌻 All mathematical and logical function return NA if there is any NA in their input vector(s). As an remedy, the na.rm=TRUE argument can exclude the NA’s from the input before calculation.

§ 3.6 Passing na.rm=TRUE to the tapply function, determine which metropolitan area has the smallest proportion of interviewees who have received no high school diploma.

#
#

Section-4 Integrating Country of Birth Data

§ 4.1 Merge the Country variable into the D data frame by merging keys CountryOfBirthCode and Code. What is the name of the variable added to the data frame by this merge operation?

D = merge(D, country, by.x="CountryOfBirthCode", by.y="Code", all.x=TRUE)

How many interviewees have a missing value for the new country variable?

#
#

§ 4.2 Among all interviewees born outside of North America, which country was the most common place of birth?

#
#

§ 4.3 What proportion of the interviewees from the “New York-Northern New Jersey-Long Island, NY-NJ-PA” metropolitan area have a country of birth that is not the United States?

#
#

§ 4.4 Which metropolitan area has the largest number (note – not proportion) of interviewees with a country of birth in India?

#
#

In Brazil?

#
#

In Somalia?

#
#

AS2-3: 美國人口普查資料

Group-00

2024-02-28 11:57:26

Section-1 Loading and Summarizing the Dataset

Section-2 Evaluating Missing Values

Section-3 Integrating Metropolitan Area Data

Section-4 Integrating Country of Birth Data