Skip to content

Instantly share code, notes, and snippets.

View sagarnanduunc's full-sized avatar

Sagar Nandu sagarnanduunc

View GitHub Profile
@sagarnanduunc
sagarnanduunc / Data Pipleline for Structural Topic Modeling in R for Twitter Data
Created April 25, 2018 14:33
How to perform Structural Topic Modeling (stm) in R? This code works on Twitter Data but can be used by any corpus if it has a unique id field
library(tidyverse)
loc <- "FILEPLATH /data.csv"
tweets <- read_csv(loc) # fetching data
# Data is already processed in Python but it can also be done in R
# Data Processing steps:
# All lower cased, No URLS, No stop words, No punctuations but # and @,
library(quanteda)
@sagarnanduunc
sagarnanduunc / Creating timeline of pandas data frame based on time granularity using Time Grouper
Last active June 19, 2018 04:21
Grouping data in pandas based on date based column. Grouping can be done based on [seconds, hours, minutes, days, months and even years]. Results in a Time vs count data frame
# "field" is just a custom name that you want to give to the count of record column in the timeline dataframe
def createTimeLine(df,field,granularity):
# Since I did it on Twitter data, I used 'postedTime' but that can be generalized as well
# Here I use Timegrouper which is group by based on time granularity (secs, mins, days, hours, months ....)
timegrp = df.set_index('postedTime').groupby(pd.TimeGrouper(freq=granularity)) # Grouping data based on Granularity
timeCount = {"day":[],field:[]} # Creating a dictionary having keys as "day" and field to convert into dataframe later
# users = len(df.groupby("actorId"))
for time_unit in timegrp: # Parsing through all the formed groups
#print(time_unit[0].strftime('%Y-%m-%d'),": ",len(time_unit[1]))
timeCount["day"].append(time_unit[0].strftime('%Y-%m-%d')) # adding the group
import nltk
import string
from nltk.tokenize import TweetTokenizer
tknz = TweetTokenizer()
from nltk.corpus import stopwords
stop = stopwords.words('english') + list(string.punctuation)
translator = str.maketrans('', '', string.punctuation.replace("#","").replace("@","").replace("'",""))
def cleanTweet(text):
text = (re.sub(r"\w+:\/{2}[\d\w-]+(\.[\d\w-]+)*(?:(?:\/[^\s/]*))*", "", text)).lower() #removes urls
# This works in MACs
# Dataframe can also be used in R
df.to_csv("PATH WHERE YOU WANT TO SAVE YOUR FILE/filename.csv",quoting=csv.QUOTE_NONNUMERIC, date_format='%Y-%m-%d %H:%M:%S', encoding='utf-8',line_terminator = '\n')
df = pd.read_csv("PATH WHERE YOUR FILE IS SAVED/filename.csv", encoding='utf-8',lineterminator = '\n',index_col=0)
@sagarnanduunc
sagarnanduunc / topic-helper.R
Created March 13, 2018 20:01
R helper function for topic modeling
analyzeTopics <- function(ctmFit,fileLoc){
td_beta <- tidy(ctmFit, matrix = "beta")
# helper functions (from David Robinson's R Package)
scale_x_reordered <- function(..., sep = "___") {
reg <- paste0(sep, ".+$")
ggplot2::scale_x_discrete(labels = function(x) gsub(reg, "", x), ...)
}
reorder_within <- function(x, by, within, fun = mean, sep = "___", ...) {