A Guide to Collecting and Finding Data Sources

July 30, 2024

A Guide to Collecting and Finding Data Sources

Within every single data science project, the quality of your data can make or break your project. Data is the backbone of every machine learning and data science project. Whether it’s building predictive models, performing exploratory analysis, or visualizing trends, access to the right data is vital in creating a functional and performative model. This week’s blog will walk you through the important parts of collecting and finding data, including widely used resources to help you gather the datasets you need.

Before collecting data, it's important to clearly define a project's objectives and requirements. What is the purpose of the project? What specific questions are you trying to answer? What kind of data will help you solve these questions? Having a prior understanding of your data needs will help you directly find the data points actually required for the project..

Where Can I Find Data?

There are numerous repositories and websites that provide users access to a wide variety of public datasets. One website I frequently use is Kaggle, which also hosts fun data science competitions. Most recently, I took a dataset from Kaggle that included thousands of images of cancerous moles to build a machine learning model that can classify melanoma moles from normal ones. There are hundreds of thousands of datasets on Kaggle that take a simple search to locate. Google Dataset Search is a specialized search engine, similar to Google Scholar but for data, that helps you find datasets across the web, while Data.gov, the U.S. government's open data resource, offers datasets on a wide range of topics. Government data is especially helpful in finding data relating to the population as a whole, such as median incomes and poverty rates.

APIs (Application Programming Interfaces) allow you to access data from various online sources through your own code. Many organizations provide APIs for programmers to utilize their data, such as the Twitter API for observing tweets and user information, the OpenWeatherMap API for real-time weather data, and the Spotify API for music and artist data. 

Another way to gain data, although a little less ethical yet one of the most popular, is web scraping. Although a lot of popular websites like Twitter and CNN like to protect their data and have anti-web scraping. There are many programming tools and libraries like Beautiful Soup and Scrapy that can assist you in extracting data from web pages. For instance, in one of my past projects I used Beautiful Soup and Selenium to automate web scraping on GOAT.com to pull sneaker data. Despite this, it’s important to be mindful of legal and ethical considerations, such as the website's terms of service and data privacy regulations. To avoid these regulations, many data scientists choose to manually conduct surveys and questionnaires. This can also be an easy and simple way to collect primary data tailored to your specific needs. I’ve personally used some online survey tools like Google Forms and SurveyMonkey, as it is easy to customize and collect specific data.

I Have the Data, Now What?

Raw data is often messy and may contain errors, missing values, or inconsistencies. Data cleaning is a crucial step to ensure the quality and reliability of your dataset. Common data cleaning tasks include removing duplicates, handling missing values, correcting errors, and standardizing formats. Validating your data ensures that it accurately represents the real-world phenomena you are studying. This involves cross-checking with other sources, using statistical methods to check for anomalies and inconsistencies, and consulting domain experts to confirm the accuracy and relevance of the data. I have another blog post from June 23 about data preprocessing more in depth for anyone interested.

Lastly, collaborating with other researchers and professionals can help you discover new data sources and gain access to datasets that may not be publicly available. There are many private datasets floating around on the web that may be useful to your application. Exploring government and institutional open data portals, using specialized search engines like Google Dataset Search, and engaging with organizations that may have the data you need can also be effective strategies.