Timeline of the Development of Data Science
April 15, 2024
I wanted to gain an understanding of how data science actually developed and grew into what it is today, so I looked at the DataFramed podcast as one of my first sources. Specifically, I looked at episode 63, The Past and Present of Data Science. This podcast on Spotify covers evolving technology and data techniques through bringing in industry members to share insights into the future of data science.
The world of data science has seen a dramatic transformation in the last decade. In this episode of the DataFramed podcast, Sergey Fogelson, Vice President of Data Science and Modeling at Viacom, dives into his experiences and sheds some insight on the exciting journey of this developing field.
Fogelson starts by first painting a picture of the early 2014 data science landscape, a time when large-scale projects were a rarity. Fast forward to today, data science has allowed for a wide range of growth in a variety of industries, from banking and tech to cybersecurity and digital advertising. This growth is fueled in part by the strong connection between data science and machine learning.
The podcast explores this evolution in detail. We learn how tools like Hadoop and Scalding, once the foundations of data science, have become considered as “legacy systems,” or outdated computing systems. Today, data scientists leverage Apache Spark and compressed efficient CSV files to tackle massive datasets sitting in terabyte-scale data warehouses. This shift has been monumental, enabling data querying within minutes instead of the laborious process of running complex algorithms on flat files.
However, Fogelson argues that the most revolutionary development is the rise of data frameworks like Airflow or Oracle databases. These frameworks streamline data pipelining, ensuring a smooth flow of data throughout the post-processing and analysis process, which have made the data science job much easier and more straightforward.
From a machine learning perspective, the landscape has become much easier to employ. Open-source, efficient frameworks now developed, eliminating the need for data scientists and developers to build everything from scratch. This, coupled with a vast supportive online community, encourages individuals to create highly complex models with relative ease. Additionally, the development of non-linear algorithms and feature stores that house dynamic datasets further streamlines the process.
Looking ahead, the future of data science has exciting possibilities. Fogelson details a future where machine learning models can be constantly updated with new data, eliminating the need to rebuild models entirely when you need a new feature or new data inputs. He also emphasizes the importance of establishing best practices for data science, including clear presentation of predictions and fostering a data-driven culture within corporations.
The podcast concludes with a valuable insight for aspiring data scientists. Fogelson highlights the crucial role of translating executive directives into measurable metrics. Being able to translate a question like "How can we increase profit by 10%?" into “operational” data analysis is a core skill for success as a data scientist. Finally, Fogelson underscores the importance of data verification and familiarization with each new dataset encountered.
This episode of DataFramed offered a really nice introduction into the topic of data science. It was really interesting hearing about the former tools and limitations within data science and the transformations since then. I’m excited for the future of data science, and the topics Fogelson mentioned are definitely insightful as future expansions which I’ll continue to research as an aspiring data scientist in the industry.