sagarnanduunc’s gists

sagarnanduunc / Data Pipleline for Structural Topic Modeling in R for Twitter Data

Created April 25, 2018 14:33

How to perform Structural Topic Modeling (stm) in R? This code works on Twitter Data but can be used by any corpus if it has a unique id field

	library(tidyverse)
	loc <- "FILEPLATH /data.csv"
	tweets <- read_csv(loc) # fetching data

	# Data is already processed in Python but it can also be done in R
	# Data Processing steps:
	# All lower cased, No URLS, No stop words, No punctuations but # and @,

	library(quanteda)

sagarnanduunc / Creating timeline of pandas data frame based on time granularity using Time Grouper

Last active June 19, 2018 04:21

Grouping data in pandas based on date based column. Grouping can be done based on [seconds, hours, minutes, days, months and even years]. Results in a Time vs count data frame

	# "field" is just a custom name that you want to give to the count of record column in the timeline dataframe
	def createTimeLine(df,field,granularity):
	# Since I did it on Twitter data, I used 'postedTime' but that can be generalized as well
	# Here I use Timegrouper which is group by based on time granularity (secs, mins, days, hours, months ....)
	timegrp = df.set_index('postedTime').groupby(pd.TimeGrouper(freq=granularity)) # Grouping data based on Granularity
	timeCount = {"day":[],field:[]} # Creating a dictionary having keys as "day" and field to convert into dataframe later
	# users = len(df.groupby("actorId"))
	for time_unit in timegrp: # Parsing through all the formed groups
	#print(time_unit[0].strftime('%Y-%m-%d'),": ",len(time_unit[1]))
	timeCount["day"].append(time_unit[0].strftime('%Y-%m-%d')) # adding the group

sagarnanduunc / function to cleaning body of tweet

Created March 28, 2018 04:54

	import nltk
	import string
	from nltk.tokenize import TweetTokenizer
	tknz = TweetTokenizer()
	from nltk.corpus import stopwords
	stop = stopwords.words('english') + list(string.punctuation)

	translator = str.maketrans('', '', string.punctuation.replace("#","").replace("@","").replace("'",""))
	def cleanTweet(text):
	text = (re.sub(r"\w+:\/{2}[\d\w-]+(\.[\d\w-]+)(?:(?:\/[^\s/]))*", "", text)).lower() #removes urls

sagarnanduunc / Making to_csv and read_csv consistent for pandas

Created March 28, 2018 03:37

	# This works in MACs
	# Dataframe can also be used in R

	df.to_csv("PATH WHERE YOU WANT TO SAVE YOUR FILE/filename.csv",quoting=csv.QUOTE_NONNUMERIC, date_format='%Y-%m-%d %H:%M:%S', encoding='utf-8',line_terminator = '\n')

	df = pd.read_csv("PATH WHERE YOUR FILE IS SAVED/filename.csv", encoding='utf-8',lineterminator = '\n',index_col=0)

sagarnanduunc / topic-helper.R

Created March 13, 2018 20:01

R helper function for topic modeling

	analyzeTopics <- function(ctmFit,fileLoc){
	td_beta <- tidy(ctmFit, matrix = "beta")

	# helper functions (from David Robinson's R Package)
	scale_x_reordered <- function(..., sep = "___") {
	reg <- paste0(sep, ".+$")
	ggplot2::scale_x_discrete(labels = function(x) gsub(reg, "", x), ...)
	}

	reorder_within <- function(x, by, within, fun = mean, sep = "___", ...) {

Sagar Nandu sagarnanduunc