Working across programs.
One of the best things I took away from working on my EPA Air Quality Data project was that in order to complete it, I had to use multiple different data sources and many different environments for analysis. It forced me to keep the thread of what I was trying to find, and how I was structuring my work in a really effective way.
First, during e.d.a. I started by writing a script to hit the EPA's Air Quality Survey API and explore the different metrics available to me: which gasses, what Geographic Coordinate System, measurement frequency, etc.
As I honed my project, I worked in a Jupyter Notebook to do the heavy lifting in data restructure, pickling out my data objects and fitted models to pass between working environments.
I also gleaned a lot of insight by doing preliminary data analysis creating visuals in Tableau. Specifically for geospatial aggregations. Choropleths and bubble maps can absolutely obscure some of the fine grained insights that are available to you in a GIS (I use QGIS), but to get a sense of your data in the early analysis, I cannot recommend them highly enough.
For example, the geographic overlay of the map below gives me an instant sense of where the factories I'm grabbing toxic release data about are clustered. Especially with the terrestrial overlay and major city labels. At a glance, I can see that we have a high concentration of industrial pollutants in the greater Los Angeles and San Francisco areas and a low concentration but high total releases in less populous areas.
I particularly like Tableau's functionality to make the bubbles semi-transparent, so I can get a clearer sense of when there is a cluster amongst smaller counties. As with anything geospatial, it pays dividends to keep in mind the distribution of administrative boundaries (here counties) is not uniform. Ex. sparsely populated regions have huge counties.
Later, as my analysis progressed and became more finalized, I wanted to build an interactive tool to allow a viewer to click into a regional map and make the air quality graph for that county pop up. In order to achieve this, I used a Python package called Folium, which interacts beautifully with GeoJSON and TopoJSON objects, and generates an interactive Leaflet map. If you have a solid working proficiency in HTML and CSS, this package is a godsend, though I will say that some of the D3-based graphing documentation using the Vincent package is out of date.
Above is a snippet to show how easy it is to load in GeoJSON files and fire up a quick visual. From there you can bind dictionary to generate a choropleth.
Finally, as I thought my mission was complete, I realized that Github strips away the JS which makes this map interactive, meaning that the hosted version in my portfolio doesn't show up!
Enter Jupyter's interactive notebook viewing environment! I had little trouble saving the project here: where it can be viewed in its entirety and clicked through accordingly.
I'll get more into the time series and predictive modeling at work in this project in another post. For the moment, I'll leave it here to say that working in many different environments, with different packages, and using disparate datasets is a good way to test your data management skills and force you into better file management practices. Also, that working in Python is rewarding because of the robust system of supporting packages out there. I highly recommend Folium and am also going to investigate geopandas as I build out this project further.
Links:
Interactive Notebook EPA Report Pt. 1 - Time series analysis on Daily Air Quality Index Scores over 3 years.
Interactive Notebook EPA Report Pt. 2 - Random Forest and Logistic regression classifiers on AQI targets from Pt. 1.