Hello, and thanks for visiting my webpage. I am a highly experienced data scientist with 10+ years of experience in quantitative and data analytics.

I have significant experience in the full machine learning lifecycle: data gathering and analysis, modeling and production.

On this page you'll find an overview of my expertise, data science projects, the tools I used and links to the underlying code and data.

  • Data Science expertise
  • Machine Learning

    Experince with a wide variety of machine learning techniques including: classification methods, clustering methods, regression techniques, neural networks, ensemble and tree methods, logistic and linear regression, support vector machines.

  • Natural Language Processing

    Sentiment analysis, language models, topic modeling and deep learning.

  • Technical expertise
  • Python data science stack

    Expertise in the full data science stack including: Pandas, Scikit-learn, Tensorflow, Keras

  • Cloud platforms: AWS, GCP

    Amazon Web Serices and Google Cloud Platform experience.

  • Big data analytics

    Worked with Hadoop, Spark, PySpark in large organizations to implement modeling and big data solutions.

Data Science Projects

Below are some of the projects that I've worked on using a variety of data analysis and machine learning tools and techniques. They cover a variety of subjects and tools. Links to the github repo with the jupyter notebook and the original data sources are provided.

Text summarization

Using a corpus of over 1 million news articles we developed a tensorflow based model to summarize news headlines. The model used cutting edge Python libraries for natural language processing and AI.

label Python, tensorflow, keras.

Bike share SF

In this project I used Google cloud platform to determine the usage of Ford Go Bikes in SF. The main focus of the project was the usage of bikes and the level of usage during commuter hours. Visualizations were done in matplotlib.

label Python, BigQuery, Matplotlib, Pandas.

House price prediction

In this project I used gradient boosting machines to predict house prices on a Kaggle dataset. I explored three techniques: linear regression (as a baseline), random forests and gradient boosting machines. The notebook with the analysis and results is available on the github.

label Python, matplotlib, scikit-learn, pandas.

Tax in the US

Using publicly available tax revenue data at the zip code level, we explored the distribution of income tax. The analysis explores which zip codes pay the highest and lowest income, which states generate the most tax revenue and contribute the highest level of per capita tax revenue.

label Python, matplotlib, pandas, ipywidgets/d3.js.

Forest cover type prediction

Using data from Kaggle this project looks at how to predict the type of a tree in a forest from topographic attributes of the region. We explored several classifiers including logistic regression, multi-layer neural networks, support vector machines, random forests and gradient boosting trees.

label Python, matplotlib, scikit-learn, pandas.