acstrahl/crunchbase.ipynb

Created June 3, 2025 20:46

Star (0) You must be signed in to star a gist
Fork (1) You must be signed in to fork a gist

Select an option

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/acstrahl/66b21e90b9a4c65506f18cf690afc987.js"></script>
Save acstrahl/66b21e90b9a4c65506f18cf690afc987 to your computer and use it in GitHub Desktop.

Download ZIP

Analyzing Startup Fundraising Deals from Crunchbase: Dataset optimization and memory handling DEMO

Raw

crunchbase.ipynb

Sorry, something went wrong. Reload?

Sorry, we cannot display this file.

Sorry, this file is invalid so it cannot be displayed.

dlong10 commented Jul 16, 2025 •

edited

Loading

Hey Anna, thank you for your seminar on this! You mentioned, as some 'homework' at the end, that we should try to find a more precise data type for raised_amount_usd. Currently it's float64 - I'm guessing you meant that we should convert it to int64 since all the values in that column are whole numbers?

However, a datatype of int64 requires all values to be non-null. Unfortunately, raised_amount_usd does not satisfy this condition.

I'm having a hard time eliminating null values in an efficient way.

For example, let's say I run for chunk in chunk_iter...chunk = chunk.dropna(subset=['raised_amount_usd']) to remove null values.

If I run chunk_iter = pd.read_csv(...) in the next cell, as we repeatedly do, it restores all the null values!

Do I have to keep running for chunk in chunk_iter...chunk = chunk.dropna(subset=['raised_amount_usd']) for every cell?

One solution that I've found is to define a unique function, which integrates pd.read_csv(...), as well as an extra step that drops null values for 'raised_amount_usd'. I can then run this unique function instead of pd.read_csv(...) in each cell.

Let me know if there's a better way? Also, let me know if there's a better communication channel to reach out!

Thanks,
Dominic

joshdisu commented Jul 18, 2025

Hi Dominic,

I've recently been working on this as homework too! For my project I downcasted raised_amount_usd to float32 (instead of converting to int64) because it handles NaN values, keeps any potential decimal precision (if present), and uses half the memory of float64.

Hope that helps!

Author

acstrahl commented Jul 18, 2025

Hey Anna, thank you for your seminar on this! You mentioned, as some 'homework' at the end, that we should try to find a more precise data type for raised_amount_usd. Currently it's float64 - I'm guessing you meant that we should convert it to int64 since all the values in that column are whole numbers?

However, a datatype of int64 requires all values to be non-null. Unfortunately, raised_amount_usd does not satisfy this condition.

I'm having a hard time eliminating null values in an efficient way.

For example, let's say I run for chunk in chunk_iter...chunk = chunk.dropna(subset=['raised_amount_usd']) to remove null values.

If I run chunk_iter = pd.read_csv(...) in the next cell, as we repeatedly do, it restores all the null values!

Do I have to keep running for chunk in chunk_iter...chunk = chunk.dropna(subset=['raised_amount_usd']) for every cell?

One solution that I've found is to define a unique function, which integrates pd.read_csv(...), as well as an extra step that drops null values for 'raised_amount_usd'. I can then run this unique function instead of pd.read_csv(...) in each cell.

Let me know if there's a better way? Also, let me know if there's a better communication channel to reach out!

Thanks, Dominic

Hi Dominic! Thanks for reaching out with your question! Because it's such a great question and would benefit other Dataquesters, would you mind posting it in the Dataquest community and tagging me (@Anna_Strahl)? I'll chime in there :)

dlong10 commented Jul 19, 2025

Hi both, thank you for your replies!

@joshdisu, I've changed raised_amount_usd to float32 as per your advice. I didn't realise that float32 saves more memory than int64! I wonder whether int64 is still a more appropriate datatype, given the homework question of finding a more precise datatype for raised_amount_usd. Especially as raised_amount_usd is all integer values...But, I guess if all we are concerned about is memory usage, then float32 is more appropriate?

@acstrahl, I've just posted my question in the Dataquest community, titled "Changing datatypes - Crunchbase Data Engineering" and have tagged you. Thank you for looking at this!

acstrahl/crunchbase.ipynb

Select an option

No results found

Select an option

No results found

dlong10 commented Jul 16, 2025 •

edited

Loading

Uh oh!

joshdisu commented Jul 18, 2025

Uh oh!

acstrahl commented Jul 18, 2025

Uh oh!

dlong10 commented Jul 19, 2025

Uh oh!

acstrahl/crunchbase.ipynb

dlong10 commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joshdisu commented Jul 18, 2025

Uh oh!

acstrahl commented Jul 18, 2025

Uh oh!

dlong10 commented Jul 19, 2025

Uh oh!

dlong10 commented Jul 16, 2025 •

edited

Loading