Tidying the CDP Data - Part 3

This post

  • merges summary of public responses saved in df_summary.csv for the emissions years 2010-2017 and all requests saved in df_responses.csv into df_cdp.csv
  • copy to SQL database
library(pacman)
p_load(tidyverse)

read df_summary.csv and df_responses.csv into R

df_summary  <- read_csv("~/Dropbox/data/cdp/investor/derived-data/df_summary.csv")
df_summary
## # A tibble: 14,551 x 6
##    account  year company                  country      sector           response
##      <dbl> <dbl> <chr>                    <chr>        <chr>            <chr>   
##  1    1800  2010 Bharat Petroleum Corpor… India        Energy           public  
##  2    5300  2010 EDP - Energias de Portu… Portugal     Utilities        public  
##  3   29900  2010 Ernst & Young LLP UK     United King… Banks            public  
##  4   10200  2010 Kizoo AG                 Germany      Information Tec… public  
##  5   22300  2010 Kungsleden AB            Sweden       Banks            public  
##  6   13400  2010 Nordea Bank              Sweden       Banks            public  
##  7   15400  2010 QLogic Corp.             USA          Information Tec… public  
##  8   32500  2010 Railway Procurement Age… Ireland      N/A              public  
##  9   23100  2010 Royal Wessanen NV        Netherlands  Consumer Staples public  
## 10   17900  2010 Sumitomo Metal Industri… Japan        Materials        public  
## # … with 14,541 more rows
df_responses <- read_csv("~/Dropbox/data/cdp/investor/derived-data/df_responses.csv")
df_responses
## # A tibble: 52,478 x 5
##    account  year company        country     industry                            
##      <dbl> <dbl> <chr>          <chr>       <chr>                               
##  1      44  2010 3i Group       United Kin… Banks, Diverse Financials, Insurance
##  2      46  2010 3i Infrastruc… Channel Is… Banks, Diverse Financials, Insurance
##  3     285  2010 3M Company     USA         Consumer Durables, Household and Pe…
##  4      30  2010 888 Holdings   Gibraltar   Hotels, Restaurants & Leisure, and …
##  5   22750  2010 A & J Mucklow… United Kin… Real Estate                         
##  6    1535  2010 A.G. Barr Plc  United Kin… Food & Beverage Processing          
##  7     276  2010 A.P. Moller -… Denmark     Water Transportation - Water Transp…
##  8      87  2010 A2A            Italy       Electric Utilities & Independent Po…
##  9   28813  2010 A4e            United Kin… Education Services                  
## 10      94  2010 Aalberts Indu… Netherlands Electrical Equipment and Machinery  
## # … with 52,468 more rows

join

cdp_df_resp  <- df_responses  %>% left_join(df_summary, by = c("account", "year", "country", "company"))

cdp_df_resp
## # A tibble: 52,478 x 7
##    account  year company     country   industry                 sector  response
##      <dbl> <dbl> <chr>       <chr>     <chr>                    <chr>   <chr>   
##  1      44  2010 3i Group    United K… Banks, Diverse Financia… Banks   public  
##  2      46  2010 3i Infrast… Channel … Banks, Diverse Financia… <NA>    <NA>    
##  3     285  2010 3M Company  USA       Consumer Durables, Hous… Indust… public  
##  4      30  2010 888 Holdin… Gibraltar Hotels, Restaurants & L… <NA>    <NA>    
##  5   22750  2010 A & J Muck… United K… Real Estate              <NA>    <NA>    
##  6    1535  2010 A.G. Barr … United K… Food & Beverage Process… <NA>    <NA>    
##  7     276  2010 A.P. Molle… Denmark   Water Transportation - … Indust… public  
##  8      87  2010 A2A         Italy     Electric Utilities & In… Utilit… public  
##  9   28813  2010 A4e         United K… Education Services       <NA>    <NA>    
## 10      94  2010 Aalberts I… Netherla… Electrical Equipment an… <NA>    <NA>    
## # … with 52,468 more rows
  • check for duplicate rows
cdp_df_resp  %>%  
  group_by(year, account)  %>% 
  tally(sort = TRUE)  
## # A tibble: 52,437 x 3
## # Groups:   year [8]
##     year account     n
##    <dbl>   <dbl> <int>
##  1  2010     348     3
##  2  2010    1763     2
##  3  2010    1766     2
##  4  2010    3310     2
##  5  2010    9028     2
##  6  2010    9874     2
##  7  2010   11623     2
##  8  2010   14289     2
##  9  2010   14332     2
## 10  2010   14545     2
## # … with 52,427 more rows
  • 40 accounts have duplicate observations
  • Remove duplicates
cdp_df_resp_1  <- cdp_df_resp  %>% 
  group_by(year)  %>% 
  distinct(account, .keep_all = TRUE) 

cdp_df_resp_1 
## # A tibble: 52,437 x 7
## # Groups:   year [8]
##    account  year company     country   industry                 sector  response
##      <dbl> <dbl> <chr>       <chr>     <chr>                    <chr>   <chr>   
##  1      44  2010 3i Group    United K… Banks, Diverse Financia… Banks   public  
##  2      46  2010 3i Infrast… Channel … Banks, Diverse Financia… <NA>    <NA>    
##  3     285  2010 3M Company  USA       Consumer Durables, Hous… Indust… public  
##  4      30  2010 888 Holdin… Gibraltar Hotels, Restaurants & L… <NA>    <NA>    
##  5   22750  2010 A & J Muck… United K… Real Estate              <NA>    <NA>    
##  6    1535  2010 A.G. Barr … United K… Food & Beverage Process… <NA>    <NA>    
##  7     276  2010 A.P. Molle… Denmark   Water Transportation - … Indust… public  
##  8      87  2010 A2A         Italy     Electric Utilities & In… Utilit… public  
##  9   28813  2010 A4e         United K… Education Services       <NA>    <NA>    
## 10      94  2010 Aalberts I… Netherla… Electrical Equipment an… <NA>    <NA>    
## # … with 52,427 more rows

Write to SQL database

  • name of SQL database is emissions
  • name of table in database cdp_summary
con <- DBI::dbConnect(odbc::odbc(), "emissions")
copy_to(con, cdp_df_resp_1, 'cdp_summary', 
    temporary = FALSE, overwrite = TRUE,
    indexes = list('account')
    )
Avatar
Elizabeth Stanny
Professor of Business Administration

My research interests include climate change and political disclosures.

Related