How to Fix Python Memory Crashes When Processing Large Files
Python scripts crash with MemoryError or get killed by the operating system (OOM killer) when they attempt to load entire large files into RAM using methods like .read() or .readlines(). To prevent this, you must stream or chunk the file so that only a small portion resides in memory at any given time.
1. Stream Text Files Line-by-Line
If you are processing text files (like logs or JSON lines), do not use f.read(). Instead, iterate directly over the file object. Python automatically buffers the input and reads line-by-line using an internal generator, keeping memory usage near zero.
# Bad: Loads the entire file into RAM
# data = open('huge_log.txt').readlines()
# Good: Memory-efficient line-by-line streaming
with open('huge_log.txt', 'r', encoding='utf-8') as file:
for line in file:
# Process the line here
if "ERROR" in line:
print(line.strip())
2. Read Binary Files in Fixed-Size Chunks
For binary files (like videos, PDFs, or raw database dumps), iterate through the file in fixed-size blocks. You can define a generator function that yields chunks of a specified byte size (e.g., 64KB or 1MB).
def read_in_chunks(file_path, chunk_size=1024*1024):
"""Generator to read a file piece by piece (default: 1MB chunks)."""
with open(file_path, 'rb') as file:
while True:
chunk = file.read(chunk_size)
if not chunk:
break
yield chunk
# Usage
for chunk in read_in_chunks('large_video.mp4'):
# Process the binary chunk (e.g., hash calculation, network upload)
pass
3. Chunk Large CSVs and DataFrames with Pandas
Loading a massive CSV into a Pandas DataFrame via pd.read_csv() can easily consume 5x to 10x the file's actual disk size in RAM due to data type inference and internal object overhead. Use the chunksize parameter to return an iterable TextFileReader object instead. This allows you to process, filter, and write out data sequentially without overloading your system's memory.
import pandas as pd
csv_file = 'massive_dataset.csv'
chunk_size = 50000 # Number of rows per chunk
# Process the file in chunks of 50,000 rows
for chunk in pd.read_csv(csv_file, chunksize=chunk_size):
# 'chunk' is a standard Pandas DataFrame
filtered_chunk = chunk[chunk['status'] == 'active']
# Append or save the processed chunk to a database or new CSV
filtered_chunk.to_csv('processed_data.csv', mode='a', header=False, index=False)
4. Use Generator Expressions to Pipeline Processing
To transform data without saving intermediate lists in memory, chain generators together. This creates a pipeline where data flows one item at a time.
# Pipeline: Read -> Clean -> Filter
lines = (line.strip() for line in open('data.txt'))
clean_lines = (line.lower() for line in lines if line)
matching_lines = (line for line in clean_lines if 'target' in line)
for line in matching_lines:
print(line) # Memory footprint remains minimal throughout
Troubleshooting: How to Verify Memory Usage
If your script still crashes, monitor its memory footprint using the built-in tracemalloc standard library. This helps pinpoint exactly which line of code is allocating the most memory.
import tracemalloc
tracemalloc.start()
# Run your chunking code here
current, peak = tracemalloc.get_traced_memory()
print(f"Peak memory usage: {peak / 10**6:.2f} MB")
tracemalloc.stop()
Summary of Best Practices
- Never use .read() or .readlines() on files of unknown or large size.
- Match chunk sizes to your RAM: 1MB to 8MB chunks are generally optimal for binary files.
- Use garbage collection: If processing multiple files in a loop, call
import gc; gc.collect()to force Python to free up unreferenced memory.
Need this done fast? order a fix on Kwork.
Need help with this?
I take on freelance fixes and builds in this area.