Data Lake & Analytics Undergraduate Research

Role: Researcher & Developer

Team: Computer Science & Business Majors

Timeline: September 2019 - May 2020

Technologies: Python, Jupyter Notebook, Pandas, BeautifulSoup, FuzzyWuzzy

About:

As an undergraduate researcher for the Severino Center for Technological Entrepreneurship, I collaborated with computer science and business majors to gather and analyze social media and business related data in an efficient manner. I developed Python scripts and utilized data science libraries to efficiently extract, parse, and analyze relevant data from large datasets, and saved them within our database directory. I attended weekly team meetings to address problems, updates, or solutions, and kept note of tasks to be completed for the following week.

Tasks:

1) Used textpreprocessing techniques and the levenshtein distance algorithm to determine if a given Instagram profile was a business or an influencer. I utilized Nasdaq's company CSV file to find a relationship between the companies in the Nasdaq file and our scraped Instagram data. Each result was either labeled as 0 or 1 and is to be used for future machine learning purposes, to make it easier to identify an Instagram business or influencer.

2) Created an algorithm to efficiently download images stored on Crunchbase's server, and uniquely identify each one with a UUID which was saved in a designated folder on our server. I used Pandas to easily query through a large CSV file to identify image links which I then used to download these images using the request library.

3) Queried through a large Kickstarter dataset to find founder and company information necessary to discover relationships in our Crunchbase dataset using the pandas library. This task involved querying through a MongoDB collection and finding the most important characteristics and features of a business and its founders alike.

4) Connected to MongoDB and extracted twitter specific information from a Crunchbase specific collection. I used string manipulation to extract the twitter user id from a user's profile URL which I then saved with a UUID (universally unique identifier) used to idenitfy a each specific user within the crunchbase collection.