RLOR: A Flexible Framework of Deep Reinforcement Learning for Operation Research
1οΈβ£ First work to incorporate end-to-end vehicle routing model in a modern RL platform (CleanRL)
β‘ Speed up the training of Attention Model by 8 times (25hours β> 3 hours)
π A flexible framework for developing model, algorithm, environment, and search for operation research
News
- 13/04/2023: We release web demo on Hugging Face π€!
- 24/03/2023: We release our paper on arxiv!
- 20/03/2023: We release jupyter lab demo and pretrained checkpoints!
- 10/03/2023: We release our codebase!
Demo
We provide inference demo on colab notebook:
Environment | Search | Demo |
---|---|---|
TSP | Greedy | |
CVRP | Multi-Greedy |
Installation
Conda
conda env create -n <env name> -f environment.yml
# The environment.yml was generated from
# conda env export --no-builds > environment.yml
It can take a few minutes.
Optional dependency
wandb
Refer to their quick start guide for installation.
File structures
All the major implementations were under rlor folder.
./rlor
βββ envs
β βββ tsp_data.py # load pre-generated data for evaluation
β βββ tsp_vector_env.py # define the (vectorized) gym environment
β βββ cvrp_data.py
β βββ cvrp_vector_env.py
βββ models
β βββ attention_model_wrapper.py # wrap refactored attention model to cleanRL
β βββ nets # contains refactored attention model
βββ ppo_or.py # implementaion of ppo with attention model for operation research problems
The ppo_or.py was modified from cleanrl/ppo.py. To see whatβs changed, use diff:
# apt install diff
diff --color ppo.py ppo_or.py
Training OR model with PPO
TSP
python ppo_or.py --num-steps 51 --env-id tsp-v0 --env-entry-point envs.tsp_vector_env:TSPVectorEnv --problem tsp
CVRP
python ppo_or.py --num-steps 60 --env-id cvrp-v0 --env-entry-point envs.cvrp_vector_env:CVRPVectorEnv --problem cvrp
Enable WandB
python ppo_or.py ... --track
Add --track
argument to enable tracking with WandB.
Where is the tsp data?
It can be generated from the official repo of the attention-learn-to-route paper. You may modify the ./envs/tsp_data.py to update the path to data accordingly.
Acknowledgements
The neural network model is refactored and developed from Attention, Learn to Solve Routing Problems!.
The idea of multiple trajectory training/ inference is from POMO: Policy Optimization with Multiple Optima for Reinforcement Learning.
The RL environments are defined with OpenAI Gym.
The PPO algorithm implementation is based on CleanRL.