Skip to content

Instantly share code, notes, and snippets.

@acstrahl
Created June 3, 2025 20:46
Show Gist options
  • Select an option

  • Save acstrahl/66b21e90b9a4c65506f18cf690afc987 to your computer and use it in GitHub Desktop.

Select an option

Save acstrahl/66b21e90b9a4c65506f18cf690afc987 to your computer and use it in GitHub Desktop.
Analyzing Startup Fundraising Deals from Crunchbase: Dataset optimization and memory handling DEMO
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@dlong10
Copy link

dlong10 commented Jul 16, 2025

Hey Anna, thank you for your seminar on this! You mentioned, as some 'homework' at the end, that we should try to find a more precise data type for raised_amount_usd. Currently it's float64 - I'm guessing you meant that we should convert it to int64 since all the values in that column are whole numbers?

However, a datatype of int64 requires all values to be non-null. Unfortunately, raised_amount_usd does not satisfy this condition.

I'm having a hard time eliminating null values in an efficient way.

For example, let's say I run for chunk in chunk_iter...chunk = chunk.dropna(subset=['raised_amount_usd']) to remove null values.

If I run chunk_iter = pd.read_csv(...) in the next cell, as we repeatedly do, it restores all the null values!

Do I have to keep running for chunk in chunk_iter...chunk = chunk.dropna(subset=['raised_amount_usd']) for every cell?

One solution that I've found is to define a unique function, which integrates pd.read_csv(...), as well as an extra step that drops null values for 'raised_amount_usd'. I can then run this unique function instead of pd.read_csv(...) in each cell.

Let me know if there's a better way? Also, let me know if there's a better communication channel to reach out!

Thanks,
Dominic

@joshdisu
Copy link

Hi Dominic,

I've recently been working on this as homework too! For my project I downcasted raised_amount_usd to float32 (instead of converting to int64) because it handles NaN values, keeps any potential decimal precision (if present), and uses half the memory of float64.

Hope that helps!

@acstrahl
Copy link
Author

Hey Anna, thank you for your seminar on this! You mentioned, as some 'homework' at the end, that we should try to find a more precise data type for raised_amount_usd. Currently it's float64 - I'm guessing you meant that we should convert it to int64 since all the values in that column are whole numbers?

However, a datatype of int64 requires all values to be non-null. Unfortunately, raised_amount_usd does not satisfy this condition.

I'm having a hard time eliminating null values in an efficient way.

For example, let's say I run for chunk in chunk_iter...chunk = chunk.dropna(subset=['raised_amount_usd']) to remove null values.

If I run chunk_iter = pd.read_csv(...) in the next cell, as we repeatedly do, it restores all the null values!

Do I have to keep running for chunk in chunk_iter...chunk = chunk.dropna(subset=['raised_amount_usd']) for every cell?

One solution that I've found is to define a unique function, which integrates pd.read_csv(...), as well as an extra step that drops null values for 'raised_amount_usd'. I can then run this unique function instead of pd.read_csv(...) in each cell.

Let me know if there's a better way? Also, let me know if there's a better communication channel to reach out!

Thanks, Dominic

Hi Dominic! Thanks for reaching out with your question! Because it's such a great question and would benefit other Dataquesters, would you mind posting it in the Dataquest community and tagging me (@Anna_Strahl)? I'll chime in there :)

@dlong10
Copy link

dlong10 commented Jul 19, 2025

Hi both, thank you for your replies!

@joshdisu, I've changed raised_amount_usd to float32 as per your advice. I didn't realise that float32 saves more memory than int64! I wonder whether int64 is still a more appropriate datatype, given the homework question of finding a more precise datatype for raised_amount_usd. Especially as raised_amount_usd is all integer values...But, I guess if all we are concerned about is memory usage, then float32 is more appropriate?

@acstrahl, I've just posted my question in the Dataquest community, titled "Changing datatypes - Crunchbase Data Engineering" and have tagged you. Thank you for looking at this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment