Skip to content

Instantly share code, notes, and snippets.

@ian-whitestone
ian-whitestone / notes.md
Last active March 1, 2023 01:45
Best practices for presto sql

Presto Specific

  • Don’t SELECT *, Specify explicit column names (columnar store)
  • Avoid large JOINs (filter each table first)
    • In PRESTO tables are joined in the order they are listed!!
    • Join small tables earlier in the plan and leave larger fact tables to the end
    • Avoid cross joins or 1 to many joins as these can degrade performance
  • Order by and group by take time
    • only use order by in subqueries if it is really necessary
  • When using GROUP BY, order the columns by the highest cardinality (that is, most number of unique values) to the lowest.
@codspire
codspire / getting-started-with-superset-airbnb-data-exploration-platform.md
Last active February 12, 2024 21:41
Getting Started With Superset: Airbnb’s data exploration platform

Getting Started With Superset: Airbnb’s data exploration platform

Update Python and PIP versions on EC2 (Amazon AMI)

At the time of writing, Python v3.5 and PIP v9.0.1 were available on AWS EC2.

sudo yum update -y
sudo yum install python35 -y
@chumo
chumo / parallel_groupby_apply.py
Created June 17, 2016 13:14
Parallelize apply after pandas groupby using PySpark
import pandas as pd
# Spark context
import pyspark
sc = pyspark.SparkContext()
# apply parallel
def applyParallel(dfGrouped, func):
# rdd with the group of dataframes
@joshlk
joshlk / faster_toPandas.py
Last active September 19, 2025 16:11
PySpark faster toPandas using mapPartitions
import pandas as pd
def _map_to_pandas(rdds):
""" Needs to be here due to pickling issues """
return [pd.DataFrame(list(rdds))]
def toPandas(df, n_partitions=None):
"""
Returns the contents of `df` as a local `pandas.DataFrame` in a speedy fashion. The DataFrame is
repartitioned if `n_partitions` is passed.
@jmindek
jmindek / gist:62c50dd766556b7b16d6
Last active January 31, 2024 15:48
DISTINCT ON like functionality for Redshift

distinct column -> For each row returned, return only the unique members of a set. Think of it as for each row in a projection, concatenate all the column values and return only the strings that are unique.

test_db=# SELECT DISTINCT parent_id, child_id, id FROM test.foo_table ORDER BY parent_id, child_id, id LIMIT 10;
parent_id | child_id | id
-----------+------------+-----------------------------
1000040 | 103 | 1000040|2645405726|0001|103
@christianroman
christianroman / test.py
Created May 30, 2013 16:02
Bypass Captcha using 10 lines of code with Python, OpenCV & Tesseract OCR engine
import cv2.cv as cv
import tesseract
gray = cv.LoadImage('captcha.jpeg', cv.CV_LOAD_IMAGE_GRAYSCALE)
cv.Threshold(gray, gray, 231, 255, cv.CV_THRESH_BINARY)
api = tesseract.TessBaseAPI()
api.Init(".","eng",tesseract.OEM_DEFAULT)
api.SetVariable("tessedit_char_whitelist", "0123456789abcdefghijklmnopqrstuvwxyz")
api.SetPageSegMode(tesseract.PSM_SINGLE_WORD)
tesseract.SetCvImage(gray,api)
print api.GetUTF8Text()
@JeffPaine
JeffPaine / us_state_abbreviations.py
Last active November 27, 2025 20:16
A python list of all US state abbreviations.
# United States Postal Service (USPS) abbreviations.
abbreviations = [
# https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States#States.
"AK", "AL", "AR", "AZ", "CA", "CO", "CT", "DE", "FL", "GA", "HI", "IA",
"ID", "IL", "IN", "KS", "KY", "LA", "MA", "MD", "ME", "MI", "MN", "MO",
"MS", "MT", "NC", "ND", "NE", "NH", "NJ", "NM", "NV", "NY", "OH", "OK",
"OR", "PA", "RI", "SC", "SD", "TN", "TX", "UT", "VA", "VT", "WA", "WI",
"WV", "WY",
# https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States#Federal_district.
"DC",
@agramfort
agramfort / demo_adaptive_lasso.py
Created January 14, 2012 10:35
Adaptive Lasso demo
"""Example of adaptive Lasso to produce event sparser solutions
Adaptive lasso consists in computing many Lasso with feature
reweighting. It's also known as iterated L1.
"""
# Authors: Alexandre Gramfort <firstname.lastname@inria.fr>
#
# License: BSD (3-clause)
import numpy as np
@carlopires
carlopires / ISO3166.py
Created October 4, 2011 15:33
Python dict for ISO3166 country codes
# -o- coding: utf-8 -o-
# ISO3166 python dict
# oficial list in http://www.iso.org/iso/iso_3166_code_lists
ISO3166 = {
'AF': 'AFGHANISTAN',
'AX': 'ÅLAND ISLANDS',
'AL': 'ALBANIA',
'DZ': 'ALGERIA',
'AS': 'AMERICAN SAMOA',