Skip to content

Instantly share code, notes, and snippets.

View jin-zhe's full-sized avatar

Jin Zhe jin-zhe

View GitHub Profile
@jin-zhe
jin-zhe / scan_large_files.sh
Last active December 19, 2025 10:10
Finds large files under given directory (optionally matching given extension) with specifiable filesize limit (default 10MB). Good for identifying files for GIT LFS.
#!/usr/bin/env bash
set -euo pipefail
# --- Default Values ---
TARGET_DIR="."
FILE_EXT=""
SIZE_LIMIT_MB=10
OUTPUT_FILE="large_files_report.txt"
# --- Help Function ---
@jin-zhe
jin-zhe / compress_pdf.md
Created June 26, 2025 06:45
PDF compression
@jin-zhe
jin-zhe / parallel_apply.py
Last active December 4, 2024 08:32
Pandas parallel apply function
'''
DESCRIPTION:
This simple convenience function provides parallelization of pandas .apply()
Adapted from: https://proinsias.github.io/tips/How-to-use-multiprocessing-with-pandas/
REQUIREMENTS:
`multiprocess` and `dill` packages are required.
```
python -m pip install multiprocess dill
```
@jin-zhe
jin-zhe / io.py
Created November 26, 2024 07:11
IO convenience functions for Python
def load_csv(csv_path: Path, ignore_first_row=True, ignore_empty_rows=True, delimiter=','):
'''
Returns all the rows of a csv file
'''
rows = []
with csv_path.open() as csvfile:
csv_reader = csv.reader(csvfile, delimiter=delimiter)
if ignore_first_row:
next(csv_reader)
for row in csv_reader:
@jin-zhe
jin-zhe / pandas_jsonl.py
Created November 24, 2024 11:15
Pandas convenience functions for reading and writing jsonl files.
import pandas as pd
def jsonl_to_df(jsonl_filepath):
return pd.read_json(jsonl_filepath, lines=True)
def df_to_jsonl(df, jsonl_filepath):
payload = df.to_json(orient='records', lines=True)
with open(jsonl_filepath, 'w') as writer:
@jin-zhe
jin-zhe / split_pdf.py
Created November 22, 2024 15:58
Simple convenience script to split a PDF using PyPDF2 package in Python.
'''
Simple script to split a PDF using PyPDF2 package in Python.
Often times we would need to split an academic paper into the main paper and the
supplementary material before submission.
To do that, the script may be simply run as:
`python split_pdf.py -in CVPR.pdf -s 15 -o`
This produces 2 files: 'CVPR.01-14.pdf' and 'CVPR.15-20.pdf', where the starting
page numbers for each split file are 1 and 15 respectively.
@jin-zhe
jin-zhe / wandb_htmltable.py
Created September 13, 2024 08:54
HTML table for wandb that supports images
'''
Workaround for logging a simple table that supports step sliding. (See issue https://github.com/wandb/wandb/issues/6286)
It's a great pity that wandb currently doesn't support this with the `wandb.Table` which is too overkill.
The `wandb_htmltable` function follows the same signature as `wandb.Table` and takes as input parameters of the same type.
It currently only supports text and image type data. Image data is realized via its byte string declared in the <img /> tag
Example:
```
my_data = [
'''
Resizes images in source image directory within given size bounds (keeping
aspect ratio) and outputs in target directory with identical directory tree
structure. Uses Magick for image resizing.
'''
import os
import argparse
import subprocess
from pathlib import Path
@jin-zhe
jin-zhe / sshfr
Created May 10, 2023 08:07
Custom ssh command with port range forwarding
# STEP 1: `$ mkdir ~/bin`
# STEP 2: `$ touch ~/bin/sshfr`
# STEP 3: `$ chmod +x ~/bin/sshfr`
# STEP 4: Copy the following contents into `~/bin/sshfr`
# STEP 5: Update .profile or .bash_profle: `$ export PATH=$PATH":$HOME/bin"`
# STEP 6: Reload .profile or .bash_profle E.g. `$ . ~/.bash_profile`
# The contents of sshfr is as follows
ADDRESS=$1
PORT_START=${2-49151}
@jin-zhe
jin-zhe / edurec_scripts.py
Last active September 7, 2022 07:15
Script for EduRec exports
from datetime import datetime
import os
import pandas as pd
import argparse
'''
Note:
- Entries start on row 3 of EduRec excel exports
- 'Student Number' column is mandatory!