Stack Overflow Dataset

Haneen Fathy
5 min readOct 11, 2020

For this week’s assignment, I chose to analyse the Stack Overflow dataset. The dataset is comprised of of all the content on stack overflow, including posts, comments, votes, tags, and badges. It is updated quarterly and is ultimately too large to be downloaded and can only be accessed through an API.

Who collected and compiled it?

The data was collected as part of the internet archive, a library that keeps a digital record of everything on the web. The dataset is owned and maintained by stack overflow themselves.

Why was it collected?

The data was collected as part of a larger initiative to archive and save the internet’s history. It was then compiled by stack overflow which I can assume was done for a number of reasons. First, it is necessary for them to observe trends on the website to maintain and update it. Second, having a history of all posts, archived or note, for any legal reasons that may arise.

Describe the data: What are the dimensions? What are the variables and their data types? What can the first 5–20 rows tell us?

Dimensions: 16 files

Every single row has its own id of the type int.

File 1 → Badges → 6 columns x 32.5m rows

Variables: badge name, date, user_id, class, tag_based (bool) , nothing missing or out of the ordinary.

File 2→ Comments → 7 columns x 74.5m rows

Variables: text, date, user_id, post_id, user_display_name, post_id, score, user_id is column is completely missing while user display name is there. The two column seem redundant so why not keep user id for consistency.

File 3→ Post History→ 8 columns x 118m rows

Variables: id, creation_date, post_id, post_history_type_id, revision_guide, user_id, text, comment. Both the user_id and the comment columns are completely missing. The user_id seems important here yet is missing. The revision guide column has some sort ip address and I’m not entirely sure what it stands for.

File 4→ Post Links→ 5 columns x 6.10m rows

Variables: id, creation_date, link_type_id, post_id, related_post_id. I assume this file’s purpose is to link towards other similar posts but what is there are more than one?

File 5 → Post Answers→ 20 columns x 27.1m rows

Variables: To make things simpler instead of writing down 20 columns, this file is very comprehensive but for some reason the number of rows are less than post history. This could mean that either the posts in this file also exist in post history hence this is redundant, or that most posts on stack overflow are not answered.

File 6 → Posts Moderator Nomination → 20 columns x 324 rows

Most columns are empty but they seem unnecessary in the first place. This file includes the text of every nomination. It still seems odd that there are only 324 nominations to be a moderator.

File 7 → Posts Orphaned Tag Wiki → 20 columns x 2 rows

Not entirely sure what this file is for.

File 8→ Posts Privilege Wiki→ 7 columns x 32.5m rows

I assumed these were posts by stack overflow themselves, almost like announcing updates? not entirely sure.

File 9→ Posts Questions→ 20 columns x 17.7m rows

Similar to post answers, the number of rows are less than post history. This could mean that either the posts in this file also exist in post history hence this is redundant, or that some posts are considered questions and some aren’t?

File 10 → Posts Tag Wiki→ 20 columns x 49.3k rows

Looking at the body text in this file, it seems to be a description of different technologies that were tagged on posts.

File 11 → Posts Tag Wiki Excerpt→ 20 columns x 49.3k rows

Redundant, like file 10.

File 12 → Posts Wiki Placeholder→ 20 columns x 4 rows

Simple placeholder texts. Most columns are empty and unnecessary though.

File 13 → Stack Overflow Posts→ 20 columns x 30.0m rows

How is this different from the other files that contain posts? Unclear.

File 14 → Tags → 5 columns x 55.7k rows

Different tags, complete and nothing out of the ordinary.

File 15 → Users → 13 columns x 10.9m rows

Almost complete except for age, it got me thinking whether there is a minimum age to make an account on stack overflow or is it accessible to everyone.

File 16 → Votes → 4 columns x 178m rows

Variables: id, creation_id, post_id, vote_id.

Complete and succinct.

General Analysis

Surprisingly enough I did not completely hate looking at data. Big datasets can be very daunting but going through it bit by bit demystified it for me. The data is very comprehensive but it was representative of the ind of users on the website in terms of their professions, years of experience, etc.., which I assume can be very useful if somebody would use this data for a real life projects.

This dataset in particular can be very useful for a lot of purposes. Using machine learning you could build a model that analyses trends in posts to predict upcoming technologies. You could also predict how long it would take to get a response on your question, or what kind of questions receive the least responses and how can we fix that?

Coding Exercise

For the coding exercise I tried to play around with Lydia’s code. I decreased the inputs to just the borough and the zip code of the incident to see if there is a wider correlation in how fast response times are to a specific neighborhood (comparing low income to high income neighborhoods). I later realized that wouldn’t work because of the differences in the incidents reported.

Prediction is very off

So I just tried adding more input fields to make it more accurate and see if there is a difference between neighborhoods and not change any of the other inputs.

However, everytime I would try to change the values and submit after the initial time I would get this error.

I assumed the error had to with ml5 cdn so I updated it to the latest version but it still wouldn’t run. I checked on lydia’s code and it had the same error.

After updating the ml5 link I got this error

I spent the majority of my time trying to get the submit button to work so I won’t have to retrain the model everytime I run my initial values but I was unable to figure it out. When I reran my model with different initial values that I change on the html file, my model still wouldn’t make predictions and I got the same error.

After a couple of hours of frustration, I decided to just leave it and go over lydia’s code with more depth so I can understand it better. I managed to spend some time reading up more on the ml5 neural network but I was unfortunately unable to get the code running.

--

--