Created
December 23, 2025 13:06
-
-
Save rien333/93d1b23a8e616dee662b84930d06fca5 to your computer and use it in GitHub Desktop.
`$WACZ` is je input file. De kern van de procedure is dat een WACZ een apart WARC bestand met screenshots embed. In dit WARC bestand zijn de screenshots gewoon opgeslagen als normale WARC-records, die je met het programma `warcio` kunt extracten.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| #!/bin/fish | |
| set tmpdir (mktemp -d) | |
| # 'archive/*' is given as an entry point to extract as little as possible | |
| unzip -qq $WACZ 'archive/*' -d $tmpdir | |
| # find the screenshot WARC within the WACZ | |
| set screenshot_warc (fd -tfile screenshots $tmpdir/archive/ | head -1) | |
| # calculate the location of the screenshot in the WARC | |
| set content_length (warcio extract --header "$screenshot_warc" 0 | rg 'Content-Length: (\d+)' -r '$1') | |
| set header_length (warcio extract --header "$screenshot_warc" 0 | wc -c) | |
| set screenshot_offset (math $content_length + $header_length + 4) | |
| # WARCs are often gzip'ed | |
| gunzip $screenshot_warc | |
| warcio extract --payload (echo "$screenshot_warc" | sd -F '.warc.gz' '.warc') $screenshot_offset > $WACZ.png | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment