VADv2 takes multi-view image sequence as input in a streaming manner, transforms sensor data into environmental token embeddings, outputs the probabilistic distribution of action, and samples one action to control the vehicle. The probabilistic distribution of action is learned from large-scale driving demonstrations. VADv2 is trained on Town03, Town04, Town06, Town07, and Town10, and evaluated on unseen Town05. It runs stably in a fully end-to-end manner, even w/o rule-based wrapper.