What I Learned about Data Analysis in College of William & Mary

4 min readJun 9, 2021

--

College of William & Mary Wren Building (Source: www.dailypress.com%2Fvirginiagazette%2Fva-vg-wm-townhall-0722-20200721-7el4ugdx3jenjjnkooan4t2ffu-story.html)

When it was still snowing in Montreal in late April this year, I would think of the good sunshine and refreshing breeze I got in Williamsburg last Spring. Located in Williamsburg, a small historic town, College of William & Mary could be so tranquil at night that I can only hear the chirping of cicadas when I’m stargazing with friends, but it could also be energetic when there are big school events and celebrations going on. I really enjoyed the time I stayed at the College of William & Mary as an exchange student, and I got to explore more what I was truly interested in and passionate about — — Interdisciplinary Data Analysis.

At that point of my life, I continued on my path of aspiring to become a future data-driven researcher in leading American think tanks that focus on Foreign Policy and International Development (i.e. 1 of my 2 majors, the other one’s Computer Science), so I joined AidData Research Lab to be a research assistant working in Transparent Development Footprints with 2 program managers and 30 other teammates. If you are interested and want to know more about the work I have done there, please check out my other post.

The other two wonderful experiences that really expanded my horizon of understanding how powerful data and the stories that data can tell are:

Conducting research in Systematic Text Analysis for International Relations (STAIR), a research lab under the International Relations Department at College of William & Mary
Taking an interdisciplinary Data Science course: Hacking Chinese Studies, and working on personal news corpora project using Natural Language Processing (NLP)techniques

Systematic Text Analysis for International Relations(STAIR) is a collaborative research group that uses text mining and machine learning tools to analyze and address political issues, with a particular focus on international relations.

Source: https://dataaspirant.com/word-embedding-techniques-nlp/

What I did specifically is in order to measure the trend of European integration:
• Gathered 30000+ articles from leading newspapers of EU countries in 2019 to build a multi-lingual corpus
• Implemented word embeddings with Gensim in Python and Jupyter Notebook to translate the Portuguese corpus into English and performed topic modeling with NMF using Python on the corpus. Visualized the results with graphing libraries E.g. Pyplot, PyLDAvis, etc.

This research opportunity showed me the huge power of NLP techniques in interdisciplinary research and opened the way for me to learn more about it and do hands-on individual projects using similar techniques on other text data. Also, I learned how to take ownership of one’s work as I’m responsible for a whole

Everything we learn starts with the classical question

Author Identification and attribution

The Federalist Papers (Source: https://www.amazon.ca/Federalist-Papers-Alexander-Hamilton-ebook/dp/B0855MYLTJ)

The Federalists: James Madison, Alexander Hamilton, John Jay (Source: https://www.pbslearningmedia.org/resource/ham16.soc.ushis.federalist/hamiltons-america-the-federalist-papers/)

The course introduced me to a variety of newly developed digital tools, algorithms, and datasets that allow us to pursue new insights into traditional Chinese literature and culture. You will engage with new scholarship being published in the rapidly expanding field of the digital humanities and learn how to create digital research projects from scratch. You will be introduced to text mining, network analysis, mapping, and digital exhibition creation, among other things. We will draw examples from the imperial Chinese tradition and cover the challenges and rewards of working with Chinese language materials using computer systems originally designed for western languages (the lessons we learn will be applicable to people working in other non-Latinate languages.

Picture

Principal component analysis

The tech tools used include but not limited to: Python (Libraries including Pandas, Numpy, Scikit-learn, Gensim, Matplotlib, Pyplot, Beautiful Soup, Seaborn, PyLDAvis, Argparse, Json, Requests)

Summary

College of William & Mary Aerial View (Source: www.pinterest.ca%2Fpin%2F183310647303744694%2F)

References:

What I Learned about Data Analysis in College of William & Mary

Written by Julianna Zhou