- Don’t
SELECT *, Specify explicit column names (columnar store) - Avoid large JOINs (filter each table first)
- In PRESTO tables are joined in the order they are listed!!
- Join small tables earlier in the plan and leave larger fact tables to the end
- Avoid cross joins or 1 to many joins as these can degrade performance
- Order by and group by take time
- only use order by in subqueries if it is really necessary
- When using GROUP BY, order the columns by the highest cardinality (that is, most number of unique values) to the lowest.
Getting Started With Superset: Airbnb’s data exploration platform
At the time of writing, Python v3.5 and PIP v9.0.1 were available on AWS EC2.
sudo yum update -y
sudo yum install python35 -y
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| import pandas as pd | |
| # Spark context | |
| import pyspark | |
| sc = pyspark.SparkContext() | |
| # apply parallel | |
| def applyParallel(dfGrouped, func): | |
| # rdd with the group of dataframes |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| import pandas as pd | |
| def _map_to_pandas(rdds): | |
| """ Needs to be here due to pickling issues """ | |
| return [pd.DataFrame(list(rdds))] | |
| def toPandas(df, n_partitions=None): | |
| """ | |
| Returns the contents of `df` as a local `pandas.DataFrame` in a speedy fashion. The DataFrame is | |
| repartitioned if `n_partitions` is passed. |
distinct column -> For each row returned, return only the unique members of a set.
Think of it as for each row in a projection, concatenate all the column values and return only the strings that are unique.
test_db=# SELECT DISTINCT parent_id, child_id, id FROM test.foo_table ORDER BY parent_id, child_id, id LIMIT 10;
parent_id | child_id | id
-----------+------------+-----------------------------
1000040 | 103 | 1000040|2645405726|0001|103
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| import cv2.cv as cv | |
| import tesseract | |
| gray = cv.LoadImage('captcha.jpeg', cv.CV_LOAD_IMAGE_GRAYSCALE) | |
| cv.Threshold(gray, gray, 231, 255, cv.CV_THRESH_BINARY) | |
| api = tesseract.TessBaseAPI() | |
| api.Init(".","eng",tesseract.OEM_DEFAULT) | |
| api.SetVariable("tessedit_char_whitelist", "0123456789abcdefghijklmnopqrstuvwxyz") | |
| api.SetPageSegMode(tesseract.PSM_SINGLE_WORD) | |
| tesseract.SetCvImage(gray,api) | |
| print api.GetUTF8Text() |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| # United States Postal Service (USPS) abbreviations. | |
| abbreviations = [ | |
| # https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States#States. | |
| "AK", "AL", "AR", "AZ", "CA", "CO", "CT", "DE", "FL", "GA", "HI", "IA", | |
| "ID", "IL", "IN", "KS", "KY", "LA", "MA", "MD", "ME", "MI", "MN", "MO", | |
| "MS", "MT", "NC", "ND", "NE", "NH", "NJ", "NM", "NV", "NY", "OH", "OK", | |
| "OR", "PA", "RI", "SC", "SD", "TN", "TX", "UT", "VA", "VT", "WA", "WI", | |
| "WV", "WY", | |
| # https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States#Federal_district. | |
| "DC", |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| """Example of adaptive Lasso to produce event sparser solutions | |
| Adaptive lasso consists in computing many Lasso with feature | |
| reweighting. It's also known as iterated L1. | |
| """ | |
| # Authors: Alexandre Gramfort <firstname.lastname@inria.fr> | |
| # | |
| # License: BSD (3-clause) | |
| import numpy as np |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| # -o- coding: utf-8 -o- | |
| # ISO3166 python dict | |
| # oficial list in http://www.iso.org/iso/iso_3166_code_lists | |
| ISO3166 = { | |
| 'AF': 'AFGHANISTAN', | |
| 'AX': 'ÅLAND ISLANDS', | |
| 'AL': 'ALBANIA', | |
| 'DZ': 'ALGERIA', | |
| 'AS': 'AMERICAN SAMOA', |