Deployment issues
Worker fails to start
If your worker fails to start or initialize:- Check logs: View endpoint logs in the Runpod console for error messages.
- Verify local testing: Ensure your handler works in local testing before deploying.
- Check dependencies: Verify all dependencies are installed in your Docker image.
- GPU compatibility: Ensure your Docker image is compatible with the selected GPU type.
- Input format: Verify your input format matches what your handler expects.
Worker initializes but fails on requests
| Issue | Solution |
|---|---|
| Input validation errors | Add input validation in your handler and check logs for the expected format |
| Missing dependencies | Verify all required packages are in your Dockerfile |
| Model loading failures | Check GPU memory requirements and model path |
| Permission errors | Ensure files are readable and directories are writable |
Job issues
Jobs stuck in queue
If jobs remainIN_QUEUE for extended periods:
- No workers available: Check if
max_workersis set appropriately. - Workers throttled: Your endpoint may be hitting rate limits. Check the Workers tab for throttled workers.
- Cold start delays: First requests after idle periods require worker initialization. Consider increasing
min_workersor enabling FlashBoot.
Jobs timing out
| Cause | Solution |
|---|---|
| Processing takes too long | Increase executionTimeout in your job policy |
| Model loading too slow | Use model caching or bake models into your image |
| TTL too short | Set ttl to cover both queue time and execution time |
Jobs failing
Check the job status response for error details. Common causes:- Handler exceptions: Unhandled exceptions in your handler code. Add try/catch blocks and return structured errors.
- OOM (Out of Memory): Model or batch size exceeds GPU memory. Reduce batch size or use a larger GPU.
- Timeout: Job exceeded execution timeout. Increase timeout or optimize processing.
Endpoint scaling issues
My endpoint was scaled down unexpectedly
If your endpoint’s max workers dropped without any change on your end, Runpod scaled the endpoint down automatically. This happens in two situations:- Prolonged inactivity: When an endpoint receives no requests for 3 days, its max workers is reduced to 2, and after 7 days its max workers is set to 0. Runpod emails you when the first reduction happens. For more details, see idle endpoint scale-down.
- Repeated unhealthy workers: When an endpoint consistently produces unhealthy (crashing) workers, Runpod scales it down to stop billing and reduce thrashing, and sends you an email.
Cold start issues
Slow cold starts
Cold start time includes container startup, model loading, and initialization. To reduce cold starts:- Use model caching: Store models on network volumes instead of downloading on each start.
- Enable FlashBoot: Use FlashBoot for faster container initialization.
- Optimize image size: Use smaller base images and remove unnecessary dependencies.
- Initialize outside handler: Load models at module level, not inside the handler function.
Too many cold starts
If you’re seeing frequent cold starts:- Increase idle timeout: Set a longer
idle_timeoutto keep workers warm between requests. - Set minimum workers: Configure
min_workers> 0 to maintain warm workers. - Check traffic patterns: Sporadic traffic causes more cold starts than steady traffic.
Logging issues
Missing logs
If logs aren’t appearing in the console:- Check throttling: Excessive logging triggers throttling. Reduce log verbosity.
- Verify output streams: Ensure you’re writing to stdout/stderr, not just files.
- Check worker status: Logs only appear for successfully initialized workers.
- Retention period: Logs older than 90 days are automatically removed.
Log throttling
To avoid log throttling:- Reduce log verbosity in production.
- Use structured logging for efficiency.
- Store detailed logs on network volumes instead of console output.
vLLM-specific issues
OOM errors
If your vLLM worker runs out of memory:- Lower
GPU_MEMORY_UTILIZATIONfrom 0.90 to 0.85. - Reduce
MAX_MODEL_LENto limit context window. - Use a GPU with more VRAM.
Model not loading
| Issue | Solution |
|---|---|
| Model not found | Verify MODEL_NAME matches the Hugging Face model ID exactly |
| Gated model access denied | Set HF_TOKEN with a token that has access to the model |
| Incompatible model | Check vLLM supported models |
OpenAI API errors
| Error | Cause | Solution |
|---|---|---|
| 401 Unauthorized | Invalid API key | Verify RUNPOD_API_KEY is correct |
| 404 Not Found | Wrong endpoint URL | Use the format https://api.runpod.ai/v2/ENDPOINT_ID/openai/v1 |
| Connection refused | Endpoint not ready | Wait for workers to initialize |
Load balancing endpoint issues
”No workers available” error
This means workers didn’t initialize in time. Common causes:- First request: Workers need time to start. Retry the request. (See Handling cold starts for more information.)
- All workers busy: Increase
max_workersto handle more concurrent requests. - Workers crashing: Check logs for initialization errors.
Requests not reaching workers
Verify your HTTP server is:- Listening on port 8000 (or the port specified in your configuration).
- Binding to
0.0.0.0, not127.0.0.1. - Returning proper HTTP responses.
Getting help
If you’re still experiencing issues:- Check endpoint logs for detailed error messages.
- SSH into workers using SSH access to debug in real-time.
- Review metrics in the Metrics tab to identify patterns.
- Contact support at help@runpod.io with your endpoint ID and error details.