This year the Center for Urban Informatics and Progress (CUIP) program is putting on its second annual data science conference (link to 2018 conference).
This year the 2019 CUIP conference is focused on analysis of smart city data and is also putting on a data modeling competition that ties into the conference focus with the theme:
"Predicting Urban Air Pollution Levels From Weather Data and Street Video Events"
Video: Example object detection (YOLO v2) video footage from Broad Street in Chattanooga
The CUIP program has been steadily building out its smart city platform for data collection at the pole and wide scale data ingest into a data lake.
Dr. Sartipi and UTC decided to release a portion of the CUIP data collected from the testbed to further engage the research community on the fruits of the data collection. To support the focus of the conference the data release is kicked off with the CUIP 2019 Smart City Data Challenge for anyone and everyone to participate.
Patterson Consulting (as a CUIP supporter) has been involved in advising on the infrastructure architecture and data science practices for the CUIP Smart City Platform. This year Patterson Consulting also helped design of the CUIP Smart City Data Challenge. This support falls in line with Patterson Consulting's long term support of regional research output and its correlation with regional economic growth and development.
What is the Data Challenge?
The 2019 Smart City Data Challenge is focused on predicting PM2.5 levels in the air from the recently open sourced CUIP 2019 Challenge Dataset. The CUIP program is interested in multiple domains related to smart city analytics, but this year they are focused specifically on how smart city's can be improved with sensor deployment across the city. One of the first applications of this sensor deployment is understanding how traffic can affect quality of life in a city, specifically traffic's impact on people's health.
The concept of the CUIP Dataset is largely drawn from the purpose and success of the ImageNet dataset. Machine learning research in any domain cannot move forward without good open datasets, no matter how exotic the applied machine learning methods are. To drive smart city application research forward it was concluded that the CUIP program needed to be a leader in building quality smart city datasets. Just like with the ImageNet dataset, the CUIP research group built a dataset and then created a competition and conference to get like-minded researchers in the same room to talk about the problem space.
For this data challenge we're looking to predict PM2.5 levels each day based on cars passing through the smart city corridor and local weather from the sensor stations in the corridor. If we can build a model to predict pollution levels at the intersection-level in an urban environment, we can build potentially build a more detailed pollution map of urban areas and better understand how our cities work. This contest is unique in how we're using both weather data and urban street level object detection data combined to predict pm2.5 levels where most competitions so far have used only weather data.
“In 2013, a study involving 312,944 people in nine European countries revealed that there was no safe level of particulates and that for every increase of 10 μg/m3 in PM10, the lung cancer rate rose 22%. The smaller PM2.5 were particularly deadly, with a 36% increase in lung cancer per 10 μg/m3 as it can penetrate deeper into the lungs. Worldwide exposure to PM2.5 contributed to 4.1 million deaths from heart disease and stroke, lung cancer, chronic lung disease, and respiratory infections in 2016.”
“Traffic congestion increases vehicle emissions and degrades ambient air quality, and recent studies have shown excess morbidity and mortality for drivers, commuters and individuals living near major roadways. Presently, our understanding of the air pollution impacts from congestion on roads is very limited. ”
It stands to reason that better understanding air pollution beyond the granularity of the city-level has value to the citizens and government of any city.
Where does pm2.5 come from?
There are some natural sources of pm2.5 levels such as forest fires, agricultural burning, volcanic eruptions and dust storms. Our focus here is how a city can be impacted by human sources of pm2.5. These human-based sources include
So now that we know a little more about pm2.5, let's better understand how the CUIP data challenge is setup.
The CUIP Smart City Data Challenge Dataset
The intial idea for the dataset release was to provide researchers an easy way to work with smart city data such that they could experiment and come up with novel applications. Bringing software into the CUIP platform, while possible, is not as simple as "just working with data". From that perspective, we wanted to give the data community some data so they could get something to work with in their own time and potentially reference the data in their own research.
The base hypothesis was that early analysis showed pm2.5 levels on the street correlated over time with vechicles passing through the street. However, there were other factors in play such as:
type of vehicle
construction around the sensor pole
time of day
It's worth noting that the two previously mentioned competitions around predicting pm2.5 levels were focused mostly on using weather data as the independent input variables. In our case the contestants can use weather data as input but they also can use more fine grain data per intersection (e.g., "vehicle detections") based on the data collected by the CUIP data platform. This allows us to model pm2.5 levels at a more granular resolution that better allows us to understand how it affects specific neighborhoods in a city on a per-intersection level.
Just like with understanding how cars move through an intersection can make traffic operations more efficient, we can better design and operate cities when we understand how pollution affects sections of our city. As car usage has increased in over-subscribed urban areas, we see their emissions significantly rise in relative contribution to city pollution. As cities become more complex with the rise of global population it is critical to measure and understand all aspects of how we operate cities.
This competition is based around further exploring how the main independent variable (vehicle object counts from video frame analysis) along with weather data can more accurately predict the pm2.5 levels during the day.
This is a compelling use of video frame detected objects becasue it potentially makes it far simpler to build fine-grained pollition maps of cities.
Going Beyond Just Cleaning Up the City
Finally, we'll note how the city has come full circle from being named infamously by Walter Cronkite as "America's Dirtiest City" on the evening news. To be fair, he wasn't wrong as the lead paragraph from the March 4, 1969 Chattanooga Times read:
"Chattanooga's particulate air pollution is ranked the worst in the nation for the period of 1961-65 in an 1,800-page publication on Air Quality Criteria for Particulate Matter just released by the Department of Health, Education and Welfare."
The air quality data is collected from the Purple Air Sensor located at each intersection along the smart corridor. Each reading includes columns such as humidity and temperature. It also has the PM2.5 level for the same point in time along with other particulate types. Take a look at the dataset website to see all of the columns.
Releasing the dataset involved a few challenges, namely the limitation that CUIP was not allowed to release image data from any of the cameras on the smart corridor. With that restriction in mind, the CUIP program used open source computer vision models to produce object detections on the data and then save only the detected objects in their data lake internally.
The object detection data is in the other main directory in the dataset and contains daily object detections as they occured per intersection.
The object detection data was more complex to release due to limitations around what information could be released.
The CUIP program could not release the video data itself, so instead they decided to release the detected objects from the video frames. These object detections were produced with the YOLOv2 Convolutional Neural Network computer vision model along with some custom scene object tracking code developed by the CUIP program.
Graphed sample of object detection data per hour at each intersection.
We'll note that the YOLOv2 model has its tradoffs but coupled with the object tracking methods developed by the CUIP program it was able to achieve good accuracy. Given that the YOLOv2 object detections are consistently effective across all frames in all intersections, any missed detections or duplicated detections should normalize out in the input data normalization/standardization/vectorization process. Beyond these disclaimers, we'll also note that this is real-life data and a seasoned practitioner will realize that many times we must deal with data that needs to be cleaned up.
We consider having to deal with complex data to be part of the competition and we look forward to hearing the methods used by the top 3 teams during the conference on the winners panel.
This contest is based around “Predicting Urban Pollution Levels From Street Video Events and Weather”.
Specifically for this competition we're going to use the column "pm2_5_cf_1" in the air quality data as the dependent variable for the model (in the purple air instructions they indicate this by saying "PM2.5 (CF=1) ug/m3 This is the field to use for PM2.5"). In your regression model you should be using the value of this column as your output/label. To better understand the meanings of the columns in the air quality data, check out this gdoc from Purple Air with schema information.
Why are there "a" and "b" channels for each pm-variable?
“PurpleAir sensors employ a dual laser counter to provide some level of data integrity. This is intended to provide a way of determining sensor health and fault detection. Some examples of what can go wrong with a laser counter are a fan failure, insects or other debris inside the device or just a layer of dust from long term exposure.
If both laser counters (channels) are in agreement, the data can be seen as excellent quality. If there are different readings from the two channels, there may be a fault with one or both.”
The held out data for the prediction task will contain data for the dates June 25th - 29th:
weather (no particulate matter information)
vehicle object detections
All data is in the CSV text file format.
The goal is to predict the missing pm2_5_cf_1 column for the days of June 25, 26, 27, 28, and 29. This means the contestant will need to make predictions for pm2_5_cf_1 across 5 days during 12 hours each day (top of the hour) for 7 different interestions
The prediction output CSV file will contain a line for all combinations of 7 intersections
from 7am-7pm (12 hours) on the top of the hour for 5 days. This will give us a CSV submission file of (7 x 12 x 5 ==) 420 different lines of pm2_5_cf_1 predictions.
Each team will save their entry in the above CSV file format and email it to firstname.lastname@example.org for leaderboard processing. If your team has questions, you can also use the same email address for any questions about the competition.
These sensors generate data readings locally and these data readings are collected and ingested with Apache Kafka into the CUIP internal data lake. From there the team is able to process and analyze the data with tools such as:
python and pandas
GPUs and TensorFlow
Kubernetes and Kubeflow
With a deployment of sensors at the edge, an enterprise-class data collection/ingestion system (Apache Kafka), and modern data analysis tools, the CUIP team has been able to drive new use cases of smart city data. For more details on the hardware installed on the CUIP smart city platform, check out their features page.
R and Python Code to Get You Started
Some of the seasoned data scientists out there may jump right in and start downloading data to analyze with their favorite tools and methods. However, for our newer data scientist friends, we wanted to provide some sample data and an example notebook (hosted on Github) on how to work with the data in python with pandas. We hope this allows people who just want to learn more about the CUIP dataset challenge an easier way to work with the data in a friendly Jupyter notebook environment. In the image below we can see some operations being performed on the sample object detection data from the dataset.
Even if you don't participate in the competition, it might be fun to play around with the data and see what you can learn about how traffic moves through MLK street in Chattanooge. We'd love to hear your stories at the competition at the coffee or poster sessions.