-
-
Save acstrahl/66b21e90b9a4c65506f18cf690afc987 to your computer and use it in GitHub Desktop.
Hey Anna, thank you for your seminar on this! You mentioned, as some 'homework' at the end, that we should try to find a more precise data type for
raised_amount_usd. Currently it'sfloat64- I'm guessing you meant that we should convert it toint64since all the values in that column are whole numbers?However, a datatype of
int64requires all values to be non-null. Unfortunately,raised_amount_usddoes not satisfy this condition.I'm having a hard time eliminating null values in an efficient way.
For example, let's say I run
for chunk in chunk_iter...chunk = chunk.dropna(subset=['raised_amount_usd'])to remove null values.If I run
chunk_iter = pd.read_csv(...)in the next cell, as we repeatedly do, it restores all the null values!Do I have to keep running
for chunk in chunk_iter...chunk = chunk.dropna(subset=['raised_amount_usd'])for every cell?One solution that I've found is to define a unique function, which integrates
pd.read_csv(...), as well as an extra step that drops null values for'raised_amount_usd'. I can then run this unique function instead ofpd.read_csv(...)in each cell.Let me know if there's a better way? Also, let me know if there's a better communication channel to reach out!
Thanks, Dominic
Hi Dominic! Thanks for reaching out with your question! Because it's such a great question and would benefit other Dataquesters, would you mind posting it in the Dataquest community and tagging me (@Anna_Strahl)? I'll chime in there :)
Hi both, thank you for your replies!
@joshdisu, I've changed raised_amount_usd to float32 as per your advice. I didn't realise that float32 saves more memory than int64! I wonder whether int64 is still a more appropriate datatype, given the homework question of finding a more precise datatype for raised_amount_usd. Especially as raised_amount_usd is all integer values...But, I guess if all we are concerned about is memory usage, then float32 is more appropriate?
@acstrahl, I've just posted my question in the Dataquest community, titled "Changing datatypes - Crunchbase Data Engineering" and have tagged you. Thank you for looking at this!
Hi Dominic,
I've recently been working on this as homework too! For my project I downcasted raised_amount_usd to float32 (instead of converting to int64) because it handles NaN values, keeps any potential decimal precision (if present), and uses half the memory of float64.
Hope that helps!