Skip to content

Instantly share code, notes, and snippets.

@sagarnanduunc
Created March 28, 2018 04:54
Show Gist options
  • Select an option

  • Save sagarnanduunc/fb53db31f6facc58ee570d8830095945 to your computer and use it in GitHub Desktop.

Select an option

Save sagarnanduunc/fb53db31f6facc58ee570d8830095945 to your computer and use it in GitHub Desktop.
import nltk
import string
from nltk.tokenize import TweetTokenizer
tknz = TweetTokenizer()
from nltk.corpus import stopwords
stop = stopwords.words('english') + list(string.punctuation)
translator = str.maketrans('', '', string.punctuation.replace("#","").replace("@","").replace("'",""))
def cleanTweet(text):
text = (re.sub(r"\w+:\/{2}[\d\w-]+(\.[\d\w-]+)*(?:(?:\/[^\s/]*))*", "", text)).lower() #removes urls
text = re.sub(r'[^\x00-\x7F]+',' ', text) # removes unicodes (emogis)
text = re.sub(r'(\r)|(\n)','',text) # removes newline characters
text= text.translate(translator) # removes punctuations except ''', '#' and '@'
tokens = tknz.tokenize(text)
temp=[]
for i in tokens:
if not i in stop:
temp.append(i)
return ' '.join(temp)
# Use this as a lambda function when cleaning body in a pandas dataframe:
df["body"] = df["body"].apply(lambda x: cleanTweet(x))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment