---
title: "R Demo for the 2018 BHI & BSN Data Challenge"
output: html_notebook
editor_options: 
  chunk_output_type: inline
---

# R Demo for the 2018 BHI & BSN Data Challenge

The toughest part is often getting the database running and the software connecting to it. This is the way I usually do it:

    run echo 'password' > .pwfile_ in a command prompt, where password is replaced with your user's postgres password
    Run this script in the same directory as this file

The resulting chunk may install some R packages on your computer.


```{r}
if(!("dplyr" %in% installed.packages()) |!("dbplyr" %in% installed.packages()) | !("RPostgreSQL" %in% installed.packages()))  { 
    install.packages(c("dplyr","dbplyr","RPostgreSQL"))
}

                     
library(dplyr); library(dbplyr); library(RPostgreSQL); library(DBI)

# Create a database connection
user <- 'postgres'
host <- 'localhost'
dbname <- 'mimic'
schema <- 'mimiciii,public'
port <- 5434
passwd <- readLines("~/.pwfile_")

pg_src <- src_postgres(dbname = dbname, 
                       host = host, port = port, 
                       user = user, password = passwd,
                       options=paste0("-c search_path=", schema))

m <- dbDriver("PostgreSQL")
con <- dbConnect(m, user=user, password=passwd, dbname=dbname,host=host,port=port) # The trickiest parts
dbSendStatement(con,paste0("SET search_path TO ",schema))

```

```{r}

# Run query and assign the results to a DataFrame
# Requires the icustay_detail view from:
# https://github.com/MIT-LCP/mimic-code/tree/master/concepts/demographics


# To get these views, the following may work (uncomment to run):
## NOTE:
## This will overwrite old views on your machine if you had already created them.
## If you have them already, you likely do not need to run this code!.
## This will take a while, possibly several hours, depending on the speed of your computer/network
#
#
## if(!("MIMICutil" %in% installed.packages())) { devtools::install_github("jraffa/MIMICutil") }
## 
## library(MIMICutil); library(RPostgreSQL)
## v <- get_views(URLlist="https://raw.githubusercontent.com/jraffa/MIMICutil/master/mat_view_urls_working",dplyrDB=pg_src,con=con)

###


query = "
WITH first_icu AS (
    SELECT i.subject_id, i.hadm_id, i.icustay_id, i.gender, i.admittime admittime_hospital, 
      i.dischtime dischtime_hospital, i.los_hospital, i.age, i.admission_type, 
      i.hospital_expire_flag, i.intime intime_icu, i.outtime outtime_icu, i.los_icu, 
      s.first_careunit
    FROM icustay_detail i
    LEFT JOIN icustays s
    ON i.icustay_id = s.icustay_id
    WHERE i.hospstay_seq = 1
      AND i.icustay_seq = 1
      AND i.age >= 16
)
SELECT f.*, o.icustay_expire_flag, o.oasis, o.oasis_prob
FROM first_icu f
LEFT JOIN oasis o
ON f.icustay_id = o.icustay_id;
"

rs <- dbSendQuery(con,query)
dat <- dbFetch(rs)

# or using dplyr:
# uncomment below
#
## icustay_detail_tbl <- tbl(pg_src,"icustay_detail")
## icustays_tbl <- tbl(pg_src,"icustays")
## oasis_tbl  <- tbl(pg_src,"oasis")
##
## first_icu_tbl <- icustay_detail_tbl %>%  select(subject_id,hadm_id, icustay_id, gender,admittime_hospital=admittime,  
##      dischtime_hospital=dischtime, los_hospital, admission_type, 
##   hospital_expire_flag,intime_icu=intime,  outtime_icu=outtime, los_icu,age,hospstay_seq,icustay_seq) %>% 
## left_join(icustays_tbl %>% select(icustay_id, first_careunit) ,by="icustay_id") %>% 
##                filter(hospstay_seq==1,icustay_seq ==1,age>=16) %>% select(-hospstay_seq,-icustay_seq)

## dat <- first_icu_tbl %>% 
##    left_join(oasis_tbl %>% select(icustay_id,icustay_expire_flag, oasis, oasis_prob),by="icustay_id") %>% 
##    collect(n=Inf)


```


# Check Data Extracted

Always a good idea to inspect the data after you have extracted it. We will look at the first six patients (rows), and then check the number of rows, and get some summary statistics of the dataset.


```{r}

# Have a look at the dataset:
head(dat)

nrow(dat) #38557

summary(dat)


```

# Add day of week to DataFrame

If we are going to examine the weekend effect, we need to pull this out of the dataset, as you can see, all we have above are dates. We will define a weekend, as anytime between Saturday (00:00:00) until Sunday (23:59:59). The dates above are shifted, and that's why they look odd, but they are matched on the day of week, so this aspect is preserved.


```{r}
if(!("lubridate" %in% installed.packages())) { install.packages("lubridate")}

library(lubridate)

dat$dow <- as.factor(wday(dat$intime_icu ))
table(dat$dow)
# Convert to text levels
levels(dat$dow) <- c("Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday")
table(dat$dow)

dat$weekend <- dat$dow %in% c("Sunday","Saturday")
table(dat$weekend)

```


# Produce some Summary Statistics by DOW and Weekday vs. Weekend

Next, it's good to look at some basic summaries of the data. We will compute simple averages and percentages/counts for each of the variables we have extracted, and look at it by day of week (dow) and weekend (weekend).

```{r}


#If this fails, you may have to install pkg-config and libnlopt-dev on your system
# e.g., in Ubuntu/Debian:
# sudo apt-get install pkg-config  libnlopt-dev

if(!("tableone" %in% installed.packages())) { install.packages("tableone")}

library(tableone)

vars <- c('gender', 'los_hospital', 'age', 'admission_type', 'hospital_expire_flag', 
           'los_icu','icustay_expire_flag', 'oasis', 'oasis_prob', 'first_careunit')

group_var <- "dow"

CreateTableOne(dat,vars=vars,strata=group_var,factorVars=c("hospital_expire_flag","icustay_expire_flag")) %>% 
      print(
  printToggle      = FALSE,
  showAllLevels    = TRUE,
  cramVars         = "kon"
) %>% 
{data.frame(
  variable_name             = gsub(" ", " ", rownames(.), fixed = TRUE), ., 
  row.names        = NULL, 
  check.names      = FALSE, 
  stringsAsFactors = FALSE)} 


```


```{r}
vars <- c('gender', 'los_hospital', 'age', 'admission_type', 'hospital_expire_flag', 
           'los_icu','icustay_expire_flag', 'oasis', 'oasis_prob', 'first_careunit')

group_var <- "weekend"

CreateTableOne(dat,vars=vars,strata=group_var,factorVars=c("hospital_expire_flag","icustay_expire_flag")) %>% 
      print(
  printToggle      = FALSE,
  showAllLevels    = TRUE,
  cramVars         = "kon"
) %>% 
{data.frame(
  variable_name             = gsub(" ", " ", rownames(.), fixed = TRUE), ., 
  row.names        = NULL, 
  check.names      = FALSE, 
  stringsAsFactors = FALSE)} 

```


It looks like there's a higher rate of hospital mortality (14.0% vs 10.8%) and ICU mortality (10.2% vs 7.8%) on weekends when compared to weekdays. There are also statistically significant differences between several other important variables, including: admission type, disease severity (OASIS), and the patient's first care unit, suggesting that these groups may be fundamentally different in some way. Let's explore this a little further.


```{r}
library(ggplot2)

dat %>% group_by(admission_type,dow) %>% 
    summarise(n=n(),mortRate=mean(hospital_expire_flag==1),
              LL95 = mortRate - qnorm(0.975)*sqrt(mortRate*(1-mortRate)/n),
             UL95 = mortRate + qnorm(0.975)*sqrt(mortRate*(1-mortRate)/n)) %>%
    ggplot(aes(dow,mortRate,group=admission_type,col=admission_type)) + 
    geom_line() + 
    geom_ribbon(aes(ymax=UL95,ymin=LL95,fill=admission_type,col=NULL),alpha=0.1) +
    xlab("Day of Week") + ylab("Hospital Mortality Rate")


```


```{r}

dat$weekend <- as.factor(dat$weekend)
dat$admission_type <- as.factor(dat$admission_type)
dat %>% group_by(admission_type,weekend) %>% 
    summarise(n=n(),mortRate=mean(hospital_expire_flag==1),
              LL95 = mortRate - qnorm(0.975)*sqrt(mortRate*(1-mortRate)/n),
             UL95 = mortRate + qnorm(0.975)*sqrt(mortRate*(1-mortRate)/n)) %>%
    ggplot(aes(weekend,mortRate,group=admission_type,col=admission_type)) + 
    geom_line() + 
    geom_ribbon(aes(ymax=UL95,ymin=LL95,fill=admission_type,col=NULL),alpha=0.1) +
    xlab("Weekend") + ylab("Hospital Mortality Rate")
```


# Model Building

Let's try to incorporate what we saw above into a very simple model. We will use logistic regression with hospital mortality as our outcome. First a unadjusted estimate, and then we will try to adjust for admission type.


```{r}

simple.glm <- glm(hospital_expire_flag ~ weekend,data=dat,family="binomial")
summary(simple.glm)

# Uncomment for pretty graph
## if(!("sjPlot" %in% installed.packages())) { install.packages("sjPlot")}
 #library(sjPlot)
 #sjp.glm(simple.glm)
```
 
 
So, looking at the crude rates, and odds ratios, we can see that patients admitted on a weekend have about a 35% increase in the odds of dying in the hospital when compared to those on a weekday. This effect is statistically significant (p<0.001).

Are we done?

I hope not. We saw from the plots above, there is likely some confounding and effect modification going on. Continuing on....


```{r}

# Without effect modification
adj.glm <- glm(hospital_expire_flag ~ weekend + admission_type,data=dat,family="binomial")
summary(adj.glm)
drop1(adj.glm,test="Chisq")

# With effect modification (interaction)
adj.int.glm <- glm(hospital_expire_flag ~ weekend*admission_type,data=dat,family="binomial")
summary(adj.int.glm)
drop1(adj.int.glm,test="Chisq")


```


```{r}

onWeekend <- expand.grid(weekend="TRUE",admission_type=levels(dat$admission_type))
onWeekday <- expand.grid(weekend="FALSE",admission_type=levels(dat$admission_type))

onWeekend$pred <- predict(adj.int.glm,newdata=onWeekend)
onWeekend 


onWeekday$pred <- predict(adj.int.glm,newdata=onWeekday)
onWeekday 

```


```{r}
onWeekend %>% inner_join(onWeekday,by="admission_type") %>% mutate(OR=exp((pred.x-pred.y)))
```

So, this mirrors what we saw above. While there may be differences between EMERGENCY and URGENT admission types, an ELECTIVE admission occurring on a weekend has an odds of mortality almost four times that of an ELECTIVE admission on a weekday.

What do you think? Do patients admitted on a weekend have a higher rate of mortality than those admitted during the week? Who is most effected, if at all?

Looking forward to see what you guys come up with!