I saw this in the Nushell Discotrd Channel. They were showing how they used a group-by and histogram together.
- The data came from Reaction_Watch.
- This describes the data.
The reaction_watch.csv "Subject" field can have more than one "field of study". This file must be split so there is a record for each "field of study". This will also update the new field, Area, with the full name of the Area instead of just a three character entry.
- Before the split, there are 68.513 records.
- After the split, there are 186,442 records.
open d:/work/visidata/retraction_watch.csv
| get Subject
| split row ";"
| compact -e
#| parse --regex '^\((?P<area>\S+)\)\s(?P<subject>\w+(?:(?:\s\w+)+(?:\/\w+)?|\/\w+(?:\s\w+)?)?)?(?:\s\-\s?)?(?P<specialization>\w+(?:\s\w+)?(?:\/\w+)?)?'
| parse --regex r#'(?x)^\( (?<area>\S+) \)\s(?<subject>\w+(?:(?:\s\w+)+(?:\/\w+)?|\/\w+(?:\s\w+)?)?)?(?:\s\-\s?)?(?<specialization>\w+(?:\s\w+)?(?:\/\w+)?)?'#
| par-each {
update area {
str replace --all 'B/T' 'B/T: Business and Technology'
| str replace --all 'BLS' 'BLS: Basic Life Sciences'
| str replace --all 'ENV' 'ENV: Environmental Sciences'
| str replace --all 'HSC' 'HSC: Health Sciences'
| str replace --all 'HUM' 'HUM: Humanities'
| str replace --all 'PHY' 'PHY: Physical Sciences'
| str replace --all 'SOC' 'SOC: Social Sciences'
}
}
| save -f d:/work/Visidata/retraction_watch_split.csvopen d:/work/Visidata/retraction_watch_split.csv
| group-by area --to-table
| par-each {update items { histogram subject } }