Hans Capener - Portfolio
HELLO! WELCOME TO MY PORTFOLIO!
Hans Capener
Education:
- Senior, Data Science major, Brigham Young University-Idaho
- Minoring in Mathematics and Statistics
- Expected Graduation: April 2025
Future Plans:
- Intend to pursue a master’s degree in Statistics, considering further studies for a PhD.
READ ME:
- The purpose of this website to display my capabilities gained from personal projects, as well as school. It has been designed to navigate quickly to whatever is of most interest of the viewer by using the side bar on the left.
- All of the code used to generate each graph on this website can be made visible by clicking the small “Code” button on the right side of the screen above each visualization.
- Some visualizations have full analyses papers associated with them; click the links to see them if you are interested. They will by highlighted red like this.
Skill Showcase
Data Wrangling and Visualization
I am adept with R(tidyverse) and Python(pandas and seaborn) for data wrangling and visualization.
Statistical Analysis
I can create interpretable multiple linear regression models, as well as perform many other statistical tests.
SQL
Comfortable querying large databases.
Familiar with database design.
Machine Learning
Familiar with Neural Networks (including CNN and RNN). Comfortable with XGBoost, RandomForest, Descision Trees, and Gradient Boosted Models. Very comfortable with all supervised learning.
1. Data Wrangling and Visualization
Tidyverse (R)
Can I recreate professionals’ visualizations?
FiveThirtyEight
Recreating a graph from an article by fivethirtyeight
My Recreation:
df <- read_csv("https://github.com/ktoutloud/classslides/raw/master/math335/extra/women_film_data.csv") |>
filter(awlevel == "07") |>
mutate(school = case_when(
school_id == 108870 ~ "AFI",
school_id == 110662 ~ "UCLA",
school_id == 123961 ~ "USC",
school_id == 190150 ~ "Columbia",
school_id == 193900 ~ "NYU"
)) |>
group_by(sex, school, year) |>
summarise(
count = sum(graduates)
) |>
pivot_wider(names_from = sex, values_from = count) |>
mutate(
perc_female = female / (male + female) * 100
)
overall <- df |>
group_by(year) |>
summarise(
perc_female = sum(female) / (sum(male) + sum(female)) * 100
) |>
mutate(school = "Overall")
namelabels <- data.frame(
school = c("AFI", "UCLA", "USC", "Columbia", "NYU", "Overall"),
year = c(2012, 2011, 2011, 2014 , 2017, 2011),
perc_female = c(32, 58, 27, 57, 58, 45)
)
ggplot(df, aes(x=year, y=perc_female, group=school, color=school)) +
geom_hline(yintercept=50, linetype='dotted', size=1, show.legend=F) +
geom_line(size=0.8, show.legend=F) +
geom_line(data=overall, size=1, show.legend=F) +
geom_point(data=overall,
shape=21,
fill="#F0F0F0",
color='#232323',
size=4,
show.legend=F) +
geom_text(data=namelabels, aes(label=school), show.legend=F) +
geom_hline(yintercept=0, size=0.4) +
scale_color_manual(values = c('AFI' = "#EBAA8A",
'UCLA' = "#96C6E7",
'USC' = "#F39BBA",
'Columbia' = "#9FDBE0",
'NYU' = "#B27CB4",
'Overall' = '#232323')) +
scale_y_continuous(limits=c(0,60),
breaks = seq(0, 60, 10),
labels = c("0", "10", "20", "30", "40", "50", "60%")) +
scale_x_continuous(breaks = seq(2010, 2018, 2),
labels = c("2010", "'12", "'14", "'16", "'18")) +
labs(
title="The cinematography pipeline has plenty of women",
subtitle="Female share of graduating classes in graduate programs related to\ncinematography at top five film schools",
y="Female share of graduating class",
caption="Top five film schools according to The Hollywood Reporter's 2016 rankings"
) +
ggthemes::theme_fivethirtyeight() +
theme(
axis.title.y = element_text(face="bold"),
axis.text = element_text(color="#ADADAD"),
axis.text.y = element_text(color=c('#ADADAD', '#ADADAD', '#ADADAD', '#ADADAD', '#ADADAD', '#232323', '#ADADAD'))
)
The data used for the recreation was slightly different, hence the difference in years on the x-axis.
Our World In Data
Recreating a visualization from Our World in Data
fruit <- read_csv("DataWranglingAndVisualization/Mimic/fruit.csv")
colnames(fruit) <- c('Entity', 'Code', 'Year', 'Fruit', 'GDP', 'Continent')
continent <- fruit %>%
filter(!is.na(Code)) %>%
group_by(Entity) %>%
fill(Continent, .direction='downup') %>%
ungroup()
year <- continent %>%
filter(Year == 2020)
country5 <- year %>%
filter(Entity %in% c('Dominica',
'Dominican Republic',
'Guyana',
'Albania',
'Papua New Guinea',
'Ghana'))
ggplot(year, aes(x=GDP, y=Fruit, color=Continent)) +
geom_point(size=2, shape=1, color='gray70') +
geom_point(alpha=0.85) +
scale_x_continuous(trans='log',
breaks=c(1000, 2000, 5000, 10000, 20000, 50000, 100000),
labels=c("$1,000", "$2,000", "$5,000", "$10,000", "$20,000", "$50,000", "$100,000")) +
scale_y_continuous(limits = c(0,400),
expand = c(0,0),
breaks=seq(0, 350, 50),
labels=c('0 kg', '50 kg', '100 kg',
'150 kg', '200 kg', '250 kg',
'300 kg', '350 kg')) +
scale_color_manual(values = c('Africa' = "#9B559D",
'Asia' = "#32847E",
'Europe' = "#536A9D",
'North America' = "#D96C58",
'Oceania' = "#925026",
'South America' = "#802F39"))+
guides(color = guide_legend(override.aes = list(shape=15, alpha=1,size=3),
keyheight = 0.9,
keywidth = 0)) +
labs(title="Fruit consumption vs. GDP per capita, 2020",
subtitle="Average per capita fruit consumption, measured in kilograms per year versus
gross domestic product (GDP) per capita, measured in constant international-$",
x="GDP per capita",
y="Fruit supply per person") +
theme_classic() +
theme(
panel.grid.major = element_line(linetype='dotted', color='gray70'),
axis.line.y = element_blank(),
axis.line.x = element_line(size=0.25),
axis.ticks.y = element_blank(),
axis.ticks.x = element_blank(),
legend.title = element_blank(),
legend.justification = c(1,1),
axis.title.x = element_text(vjust=-1),
axis.title.y = element_text(vjust=4)
)
Original Graph:
Pandas, Seaborn and Altair (Python)
Pandas is used for data wrangling, while Seaborn and Altair are both Python Visualization libraries.
Which airport is best?
“Delay Rating” is the porportion of flights which are delayed multiplied by the average hours of delay
flights = pd.read_json("https://github.com/byuidatascience/data4missing/raw/master/data-raw/flights_missing/flights_missing.json")
# WRANGLING - PANDAS
# Gets rid of characters and just leaves numbers
flights['num_of_delays_carrier'] = (
flights['num_of_delays_carrier'].str.replace(r'\D', '', regex=True)
)
# Replaces blank strings, -999, and "n/a" with the actual NaN value
flights = (flights
.replace(["", -999, "n/a"], np.nan)
.replace(["Febuary"], "February")
)
# Fills NaN values in num_of_delays_late_aircraft with the mean of the column
mean_late_air = flights.num_of_delays_late_aircraft.mean()
flights.num_of_delays_late_aircraft.fillna(mean_late_air, inplace=True)
# Fills NaN values with the month before them
flights.month.ffill(inplace=True)
totals = (flights
.groupby("airport_code")
.agg(
total_minutes_delayed =
("minutes_delayed_total", np.sum),
total_delays =
("num_of_delays_total", np.sum),
total_flights =
("num_of_flights_total", np.sum),
).assign(
total_hrs_delayed = lambda df: df.total_minutes_delayed / 60,
ave_hrs_delayed = lambda df: df.total_hrs_delayed / df.total_delays,
proportion_delayed = lambda df: df.total_delays / df.total_flights,
delay_rating = lambda df: df.proportion_delayed * df.ave_hrs_delayed
).sort_values('delay_rating', ascending=False)
.reset_index()
)
# VISUALIZATION - ALTAIR
best_airport = alt.Chart(totals,
title= alt.Title(
"Airports Rated by Delay",
subtitle= "The higher the rating, the worse airport")).encode(
x = alt.X('delay_rating:Q', title="Delay Rating"),
y= alt.Y('airport_code:N', title="Airport Code", sort="-x"),
color=alt.Color('airport_code:N', legend=None).scale(scheme="tealblues")
).mark_bar()
best_airport.save("Python/best_airport.html",embed_options={'width': 400})
2. Statistical Analysis
Linear Regression
When should I sell my Honda Accord?
Or, which Honda Accord should I buy? The “Buying Point” represents a Honda Accord for sale on KSL which, if bought and sold when reaching 75,000 miles, would allow you to drive it for 39,761 miles and gain \(\approx 2\) cents per mile on its selling price. For more info …click here to see the full analysis
car_raw <- read_csv('LinearRegression/CarSellingPrice/Honda_Accord_Sales_Data.csv')
car <- car_raw |>
mutate(
Year = as.factor(Year),
generation = case_when(
Year %in% c(2013,2014,2015,2016,2017) ~ '2013-17',
TRUE ~ '2018-22'
),
Miles = Miles / 1000,
Cost = Cost / 1000
)
car.lm <- lm(log(Cost)~Miles, data=car)
b <- coef(car.lm)
mylm <- lm(Cost~Miles, data=car)
confintv <- exp(predict(car.lm, interval="confidence"))
predintv <- exp(predict(car.lm, interval="prediction"))
buy_point <- data.frame(Miles=35.239, Cost=14.971, generation='2013-17')
sell_point <- data.frame(Miles=75, Cost=15.83134, generation='2013-17')
ggplot(car, aes(x=Miles, y=Cost, color=generation)) +
geom_ribbon(aes(ymin=confintv[,2], ymax=confintv[,3]),
alpha=0.1, fill='skyblue', color='skyblue3') +
geom_ribbon(aes(ymin=predintv[,2], ymax=predintv[,3]),
alpha=0.1, fill='firebrick3', color='firebrick') +
geom_point() +
geom_text(aes(label='Selling Point, $15,831'),
x=100, y=17, color='hotpink') +
geom_text(aes(label='Buying Point, $14,971'),
x=34, y=16, color='skyblue') +
geom_segment(data=buy_point, xend=75, yend=15.83134) +
geom_point(data=buy_point, size=3, color='skyblue2', ) +
stat_function(fun=function(x) exp(b[1]+b[2]*x), aes(color='log(Cost)')) +
geom_point(data=sell_point, size=3, color='hotpink') +
theme_bw() +
labs(
title="Honda Accord Offers on KSL.com",
y="Sales Price (in thousands of $)",
x="Miles (in thousands of mi)"
)
Multiple Linear Regression
What effects a house’s price the most?
This section only includes the information about the regression model. For further insights on which features of a house have the greatest impact on its price, refer to the “Interpreting the Model” section of the complete analysis.
houses <- read.csv("LinearRegression/MultipleLR/train.csv", header=T, stringsAsFactors = T)
# NA fill "No ____"
correctNAlist <- c("Alley", "Fence", "PoolQC", "FireplaceQu",
"MiscFeature", "GarageQual",
"GarageCond","GarageFinish", "GarageType",
"BsmtQual", "BsmtCond", "BsmtExposure", "BsmtFinType1",
"BsmtFinType2", "MasVnrType")
for (item in correctNAlist) {
houses <- houses |>
mutate(
!!item := as.character(!!sym(item)),
!!item := replace_na(!!sym(item), paste("No_", item)),
!!item := as.factor(!!sym(item))
)
}
# NA fill 0
fill_0 <- c("LotFrontage", "MasVnrArea")
for (item in fill_0) {
houses <- houses |>
mutate(
!!item := replace_na(!!sym(item), 0)
)
}
houses <- houses |>
mutate(
Electrical = replace_na(Electrical, "SBrkr"),
GarageYrBlt = ifelse(GarageQual == "No_ GarageQual", 0, GarageYrBlt),
TotalSF = TotalBsmtSF + X1stFlrSF + X2ndFlrSF,
TotalSF = ifelse(TotalSF > 6000, mean(TotalSF), TotalSF),
PercBsmtFin = (TotalBsmtSF - BsmtUnfSF) / BsmtUnfSF,
PercBsmtFin = ifelse(is.na(PercBsmtFin) |
PercBsmtFin=="Inf", 0, PercBsmtFin),
Has2nd = ifelse(X2ndFlrSF == 0, 1, 0),
Has2nd = as.factor(Has2nd),
HasBsmt = ifelse(TotalBsmtSF == 0, 1, 0),
HasBsmt = as.factor(HasBsmt),
BsmtExcellent = ifelse(BsmtQual == "Ex", 1, 0),
GarageCar3 = ifelse(GarageCars == 3, 1, 0),
#### BOOLEAN ####
# 2nd Floor
X2ndFlr = ifelse(X2ndFlrSF > 0, 1, 0),
RichNeigh = case_when(
Neighborhood %in% c("StoneBr","NridgHt","NoRidge") ~ 18.6342,
Neighborhood %in% c("Blmngtn", "ClearCr", "CollgCr",
"Crawfor", "Gilbert", "NWAmes",
"SawyerW", "Somerst", "Timber", "Veenker") ~ 13.2532,
T ~ 7.3327
),
KitchenQ = case_when(
KitchenQual == "Ex" ~ 3.28555,
KitchenQual == "Gd" ~ -1.16439,
KitchenQual == "TA" ~ -1.88592,
KitchenQual == "Fa" ~ -2.22990,
),
OverallQ = case_when(
OverallQual == 1 ~ 0.4709,
OverallQual == 2 ~ -5.5332,
OverallQual == 3 ~ 18.5539,
OverallQual == 4 ~ 28.4397,
OverallQual == 5 ~ 34.9930,
OverallQual == 6 ~ 43.1099,
OverallQual == 7 ~ 53.9437,
OverallQual == 8 ~ 65.5431,
OverallQual == 9 ~ 82.3895,
OverallQual == 10 ~ 93.5527
),
customX = 42.872*(TotalSF) +
9658.661*(KitchenQ) +
1759.385*(OverallQ) +
4113.104*(RichNeigh),
megaSwitch = ifelse(MSZoning %in% c("RL", "FV", "RH"), 1, 0)
)
set.seed(122)
num_rows <- 1000 #1460 total
keep <- sample(1:nrow(houses), num_rows)
train <- houses[keep, ] #Use this in the lm(..., data=mytrain)
test <- houses[-keep, ] #Use this in the predict(..., newdata=mytest)
final.lm <- lm(SalePrice~customX + customX:GarageCar3 + GarageCar3,data=train)
b <- coef(final.lm)
ggplot(train, aes(x=customX, y=SalePrice, color=as.factor(GarageCar3))) +
geom_point(alpha=0.3) +
scale_y_continuous(expand=c(0,0), limits = c(0, 630000), labels=c("$0", "$200k", "$400k", "600k")) +
scale_x_continuous(labels=c("0", "100k", "200k", "300k", "400k", "500k")) +
scale_color_manual(values = c("skyblue", "firebrick"), labels = c("No", "Yes")) +
stat_function(fun=function(x) b[1] + b[2]*x, color="skyblue3") +
stat_function(fun=function(x) (b[1] + b[3]) + (b[2] + b[4])*x, color="firebrick3") +
labs(title="Can you Predict a House's Sale Price",
subtitle="Just by knowing things about the house?",
x="Custom X Variable",
y="Sale Price ($)",
color="Does the house have \n a 3 car garage?") +
theme_classic() +
theme(
panel.grid.major = element_line(color='gray95', linetype='dashed'),
panel.grid.minor = element_line(color='gray95', linetype='dashed'),
axis.line.y = element_line(color='gray90', linetype='dashed'),
axis.ticks.y = element_blank()
)
Here is the summary of the multiple linear regression:
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | -15975 | 3923 | -4.072 | 5.03e-05 |
customX | 0.8676 | 0.01851 | 46.87 | 3.57e-254 |
GarageCar3 | -84158 | 14575 | -5.774 | 1.034e-08 |
customX:GarageCar3 | 0.3429 | 0.04518 | 7.591 | 7.303e-14 |
Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
---|---|---|---|
1000 | 29537 | 0.8463 | 0.8458 |
After running the model on a validation set of data, the Adjusted R-squared only dropped by 0.04, showing that this model does work well at predicting new data. It should be noted that this model can predict within $60k of the actual sales price 95% of the time.
y <- predict(final.lm, newdata=test)
ybar <- mean(test$SalePrice)
SSTO <- sum( (test$SalePrice - ybar)^2 )
# Compute SSE for each model using SalePrice - yhat
SSE <- sum( (test$SalePrice - y)^2 )
# Compute R-squared for each
rs <- 1 - SSE/SSTO
n <- length(test$SalePrice) #sample siz
p <- length(coef(final.lm))
rsa <- 1 - (n-1)/(n-p)*SSE/SSTO
my_output_table2 <- data.frame(Model = c("MyLM"), `Original R2` = c(summary(final.lm)$r.squared), `Orig. Adj. R-squared` = c(summary(final.lm)$adj.r.squared), `Validation R-squared` = c(rs), `Validation Adj. R^2` = c(rsa))
colnames(my_output_table2) <- c("Model", "Original $R^2$", "Original Adj. $R^2$", "Validation $R^2$", "Validation Adj. $R^2$")
knitr::kable(my_output_table2, escape=TRUE, digits=4)
Model | Original \(R^2\) | Original Adj. \(R^2\) | Validation \(R^2\) | Validation Adj. \(R^2\) |
---|---|---|---|---|
MyLM | 0.8463 | 0.8458 | 0.804 | 0.8027 |
Other Statistical Methods
T Test
Do Men play video games more than Women?
HSS <- read_csv("Analyses/TTest/HighSchoolSeniors.csv")
hist <- ggplot(HSS) +
geom_histogram(
aes(x=Video_Games_Hours,
fill= Gender,
color= Gender),
alpha=0.2,
position='identity'
) +
theme_light() +
labs(x="Average Hours of Video Games Played Per Week",
y="Count")
box <- HSS |>
drop_na() |>
ggplot(aes(y=Video_Games_Hours, x=Gender)) +
geom_boxplot(aes(fill=Gender), color=rgb(0.2,0.2,0.2,0.7)) +
theme_light() +
labs(x="Gender",
y="Average Hours of Video Games Played Per Week")
gridExtra::grid.arrange(hist, box, ncol=2, top = grid::textGrob("Male vs. Female Video Game Usage"))
Given our signifcant p-value, we can confidently state that men play significantly more video games than women.
ANOVA
Does Music and/or Laptop Position affect Typing Speed?
For more details, view analysis
typeSpeed <- read_csv("Analyses/ANOVA/type_data.csv")
stance <- ggplot(typeSpeed, aes(x= Stance, y= WPM)) +
geom_boxplot(fill='lightgray', notch=TRUE) +
geom_dotplot(aes(fill= Music), binaxis= 'y', stackdir= 'up',
position=position_dodge(0.2),
show.legend=F) +
theme_light()
music <- ggplot(typeSpeed, aes(x= Music, y= WPM, fill= Music)) +
geom_boxplot(fill='lightgray', notch=TRUE) +
geom_dotplot(binaxis= 'y', stackdir= 'up',
position=position_dodge(0.2)) +
theme_light() +
theme(legend.position=c(0.75,0.85))
gridExtra::grid.arrange(stance, music,
ncol=2,
top=grid::textGrob("Typing Speed Based on Laptop Position and Music"))
ggplot(typeSpeed, aes(x=Music, y=WPM, group=Stance, color=Stance)) +
geom_point(color='firebrick') +
stat_summary(fun='mean', geom='line') +
labs(title="Typing Speed Based on Non-Vocal(NV), Vocal(V), and No Music(None)",
y="WPM",
x="Type of Music") +
theme_light()
typeSpeed.aov <- aov(WPM ~ Music + Stance + Music:Stance, data=typeSpeed)
summary(typeSpeed.aov) %>%
pander()
Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
---|---|---|---|---|---|
Music | 2 | 128.1 | 64.06 | 0.9093 | 0.4288 |
Stance | 1 | 64.22 | 64.22 | 0.9117 | 0.3585 |
Music:Stance | 2 | 18.78 | 9.389 | 0.1333 | 0.8765 |
Residuals | 12 | 845.3 | 70.44 | NA | NA |
Due to insignificant results, we cannot reject our null hypothesis that music and laptop position are not correlated to typing speed. However, it is interesting that a new typing speed record of 112 WPM was set when listening to non-vocal music while sitting on the floor with my laptop in my lap.
Wilcoxon Test
What learning method leads to the best outcomes?
3 learning methods were attempted in a recalling words test.The following graphic represents the distribution of the test results from each group.
For more results / interpretation: see the full analysis
ggplot(Friendly, aes(
x= condition,
y= correct,
fill= condition)) +
geom_boxplot(fill= c('cyan3', 'cyan', 'gray'), color= 'black',
notch=TRUE, width= 0.4) +
geom_dotplot(binaxis= 'y', position= 'dodge', stackdir='center',
binwidth = 0.5) +
scale_fill_manual(values=c("cyan", "cyan3", "lightgray")) +
labs(title="Recalling Words Test",
x= "Testing Methods",
y= "Final Score") +
theme_light() +
theme(legend.position= 'none')
wilcox.test(Friendly$correct[Friendly$condition == 'Meshed'], Friendly$correct[Friendly$condition == 'Before']) %>%
pander(caption="Wilcoxon rank some test with continuity correction (there are ties in the data)")
Test statistic | P value | Alternative hypothesis |
---|---|---|
38 | 0.378 | two.sided |
With a p-value greater than our significance level of \(\alpha=0.05\), we fail to reject our null hypothesis, and continue to assume all three learning methods do not significantly differ in results. However, analyzing the box plots more, we can see that the “Before” and “Meshed” methods’ boxplots’ significance ranges are outside of that of the “SFR” method, telling us that they do differ.
Conclusion: The “Before” or “Meshed” learning approaches should be used over the “SFR” method.
Logistic Regression
Can you predict happiness?
Using data from the GSS (General Social Survey), I created a model which predicted whether or not someone was happy based on a Happiness rating, which I derived from 11 of the survey questions involving physical and mental health, social relations, education, and religion. In the analysis, I also ran a RandomForest ML model to predict one’s happiness based on their survey answers with an 83% accuracy.
gss2021 <- read_sas("Analyses/LogisticRegression/gss2021.sas7bdat",
NULL)
# Filter Data and Create Happy_Rating
mygss <- gss2021 %>%
select(c('HAPPY', 'HAPMAR','LIFE', 'SATSOC', 'HLTHMNTL', 'ACTSSOC', 'HEALTH', 'EMOPROBS', 'ATTEND', 'EDUC', 'RELACTIV', 'HLTHPHYS')) %>%
filter(!is.na(HAPPY)) %>% # Drop NA rows in the "HAPPY" column
mutate( # Merge "Pretty Happy" and "Very Happy" columns to just be "Happy" ~ 1
HAPPY = case_when(
HAPPY %in% c(1,2) ~ 1,
HAPPY %in% 3 ~ 0
)
) %>%
mutate_all(~ifelse(is.na(.), mean(., na.rm = TRUE), .)) %>% # Fill NA with mean()
mutate_all(~ . / max(.)) %>% # Divide each column by its max value
mutate( # Invert scale on questions that had inverse numbered answers
ATTEND = ATTEND * -1 + 1,
RELACTIV = RELACTIV * -1 + 1,
EDUC = EDUC * -1 + 1,
) %>%
mutate( # Scale each factor to match its importance in the ML
Happy_Rating = -100 * (EDUC * 0.15426218 +
ATTEND * 0.12488643 +
HAPMAR * 0.11542531 +
RELACTIV * 0.11377695 +
SATSOC * 0.08842675 +
EMOPROBS * 0.08201357 +
LIFE * 0.07298428 +
ACTSSOC * 0.06718067 +
HLTHMNTL * 0.06597608 +
HLTHPHYS * 0.06001798 +
HEALTH * 0.05504979) + 100,
Happy_Rating_Unscaled = -10 * (EDUC +
ATTEND + HAPMAR +
RELACTIV + SATSOC +
EMOPROBS + LIFE +
ACTSSOC + HLTHMNTL +
HLTHPHYS + HEALTH) + 100
)
# Run Logistic Regression
gss.glm <- glm(HAPPY ~ Happy_Rating, data = mygss, family=binomial)
# Graph Logistic Regression
palette(c(rgb(1,0,0, alpha =0.05),rgb(0,0,1, alpha =0.05)))
plot(HAPPY ~ Happy_Rating, data = mygss, col=as.factor(HAPPY), pch=16, cex=2, xlab = "Happy Rating", ylab = "Are you Happy?", main = "Can You Predict Whether or not Someone is Happy?")
b <- coef(gss.glm)
curve(exp(b[1] +b[2]*x)/(1+exp(b[1] +b[2]*x)), add=TRUE)
legend(x=65, y=0.65, legend = c("Yes", "No"), col = c('blue', 'red'), pch = 16, title = "Are you Happy?")
3. SQL
Querying
Of all players with over 100 bats, who, of all baseball players, had the highest career batting average?
Note: this table was generated in a python file in Visual Studio, and then copied over, due to the limitations of running python code in a RMD file along with other R code. However, the code used can still be seen by clicking “Show”
import sqlite3
try:
con = sqlite3.connect('lahmansbaseballdb.sqlite')
batting_avg_career = pd.read_sql_query(
"""
SELECT playerID, SUM(H), SUM(AB), 1.0*SUM(H) / SUM(AB) as Career_Batting_Avg
FROM Batting
GROUP BY playerID
HAVING SUM(AB) >= 100
ORDER BY Career_Batting_Avg DESC, playerID
LIMIT 5
""", con)
except:
pass
I am comfortable with SQL querying.
Database Design
I am familiar with MySQL and database design. I have not yet experienced uploading the format to a server and fully implementing it, but I do understand the principles of good database design; below is a database I made replicating that of what nitrotype, an online type racing game, would use.
4. Machine Learning
All of my Machine Learning experience has been in Python using the Tensorflow-Keras library. I am comfortable with:
- Decision Trees
- Random Forest
- Gradient Boost
- XGBoost
- Neural Networks
- Typical “Feed-Forward” Network
- Convolutional Neural Network (CNN)
- Image Classification
- Recurrent Neural Network (RNN)
- Text Generation
Here are some examples of my work
XGBoost
NN
1-hour Coding Challenge: Create a ML model to classify rice species:
Within the hour, I was able to:
- Create some basic data exploration visualizations
- Compare and contrast an XGBoost model and Neural Network on the data.
- Refine the Neural Network to an F1-score of 0.944.
My model was within the top 5 of the class.
CNN - Image Recognition
Trained a CNN to classify 43 different types of street signs in Germany. This model performed with 96% accuracy on a holdout set of images, never before seen by the model.