Skip to the content.

RLOR: A Flexible Framework of Deep Reinforcement Learning for Operation Research

1️⃣ First work to incorporate end-to-end vehicle routing model in a modern RL platform (CleanRL)

⚑ Speed up the training of Attention Model by 8 times (25hours –> 3 hours)

πŸ”Ž A flexible framework for developing model, algorithm, environment, and search for operation research

News

Demo

We provide inference demo on colab notebook:

Environment Search Demo
TSP Greedy
Open In Colab
CVRP Multi-Greedy
Open In Colab

Installation

Conda

conda env create -n <env name> -f environment.yml
# The environment.yml was generated from
# conda env export --no-builds > environment.yml

It can take a few minutes.

Optional dependency

wandb

Refer to their quick start guide for installation.

File structures

All the major implementations were under rlor folder.

./rlor
β”œβ”€β”€ envs
β”‚   β”œβ”€β”€ tsp_data.py # load pre-generated data for evaluation
β”‚   β”œβ”€β”€ tsp_vector_env.py # define the (vectorized) gym environment
β”‚   β”œβ”€β”€ cvrp_data.py 
β”‚   └── cvrp_vector_env.py 
β”œβ”€β”€ models
β”‚   β”œβ”€β”€ attention_model_wrapper.py # wrap refactored attention model to cleanRL
β”‚   └── nets # contains refactored attention model
└── ppo_or.py # implementaion of ppo with attention model for operation research problems

The ppo_or.py was modified from cleanrl/ppo.py. To see what’s changed, use diff:

# apt install diff
diff --color ppo.py ppo_or.py

Training OR model with PPO

TSP

python ppo_or.py --num-steps 51 --env-id tsp-v0 --env-entry-point envs.tsp_vector_env:TSPVectorEnv --problem tsp

CVRP

python ppo_or.py --num-steps 60 --env-id cvrp-v0 --env-entry-point envs.cvrp_vector_env:CVRPVectorEnv --problem cvrp

Enable WandB

python ppo_or.py ... --track

Add --track argument to enable tracking with WandB.

Where is the tsp data?

It can be generated from the official repo of the attention-learn-to-route paper. You may modify the ./envs/tsp_data.py to update the path to data accordingly.

Acknowledgements

The neural network model is refactored and developed from Attention, Learn to Solve Routing Problems!.

The idea of multiple trajectory training/ inference is from POMO: Policy Optimization with Multiple Optima for Reinforcement Learning.

The RL environments are defined with OpenAI Gym.

The PPO algorithm implementation is based on CleanRL.