基于时序融合的自动驾驶多任务感知算法

刘占文; 范颂华; 齐明远; 董鸣; 王品; 赵祥模

doi:10.19818/j.cnki.1671-1637.2021.04.017

基于时序融合的自动驾驶多任务感知算法

doi: 10.19818/j.cnki.1671-1637.2021.04.017

1.
长安大学信息工程学院，陕西西安 710064
2.
加利福尼亚大学伯克利分校，加利福尼亚伯克利 94804-4648

基金项目:

国家自然科学基金项目 U1864204

国家重点研发计划项目 2019YFB1600103

陕西省重点研发计划项目 2018ZDXM-GY-044

详细信息

作者简介:
刘占文(1983-)，女，山东青岛人，长安大学副教授，工学博士，从事深度学习与计算机视觉研究

通讯作者:
赵祥模(1966-)，男，重庆大足人，长安大学教授，工学博士

中图分类号: U491.2
计量
- 文章访问数: 1144
- HTML全文浏览量: 274
- PDF下载量: 131
- 被引次数: 0
出版历程
- 收稿日期: 2021-02-03
- 网络出版日期: 2021-09-16
- 刊出日期: 2021-08-01

Multi-task perception algorithm of autonomous driving based on temporal fusion

1.
School of Information Engineering, Chang'an University, Xi'an 710064, Shaanxi, China
2.
University of California, Berkeley, Berkeley 94804-4648, California, USA

Funds:

National Natural Science Foundation of China U1864204

National Key Research and Development Program of China 2019YFB1600103

Key Research and Development Program of Shaanxi 2018ZDXM-GY-044

More Information

Author Bio:
LIU Zhan-wen(1983-), female, associate professor, PhD, zwliu@chd.edu.cn

Corresponding author: ZHAO Xiang-mo(1966-), male, professor, PhD, xmzhao@chd.edu.cn

摘要

摘要: 采用连续图像帧作为输入，挖掘连续图像帧之间的时序关联信息，构建一种融合时序信息的多任务联合驾驶环境视觉感知算法，通过多任务监督联合优化，实现交通参与目标的快速检测，同时获取可通行区域信息；采用ResNet50作为骨干网络，在骨干网络中构建级联特征融合模块，捕捉不同图像帧之间的非局部远程依赖关系，将高分辨率图像通过卷积下采样处理，加速不同图像帧的特征提取过程，平衡算法的精度和速度；在不同的图像帧中，为了消除由于物体运动产生的空间位移对特征融合的影响，且考虑不同图像帧的非局部关联信息，构建时序特征融合模块分别对不同图像帧对应的特征图进行时序对齐与匹配，形成融合全局特征；基于共享参数的骨干网络，利用生成关键点热图的方法对道路中的行人、车辆和交通信号灯的位置进行检测，并利用语义分割子网络为自动驾驶汽车提供道路可行驶区域信息。研究结果表明：提出的感知算法以多帧图像代替单一帧图像作为输入，利用了多帧图像的序列特性，级联特征融合模块通过下采样使得计算复杂度降低为原来的1/16，与CornerNet、ICNet等其他主流模型相比，算法检测精确率平均提升了6%，分割性能平均提升了5%，并保持了每秒12帧图像的处理速度，在检测与分割速度和精度上具有明显优势。
- 交通信息工程 /
- 环境感知 /
- 时序融合 /
- 目标检测 /
- 语义分割
Abstract: The sequential image frames were used as input to mine the temporal associated information among the continuous image frames, and a multi-task joint driving environment perception algorithm fusing the temporal information was constructed to rapidly detect the traffic participation targets and drivable area through multi-task supervision and joint optimization. ResNet50 was used as the backbone network, in which a cascaded feature fusion module was built to capture the non-local remote dependence among different image frames. The high-resolution images were processed by the convolution subsampling to accelerate the feature extraction process of different image frames, balancing the detection accuracy and speed of the algorithm. In order to eliminate the influence of spatial displacements of the objects among the image frames on the feature fusion, and considering the non-local dependence of the features of different image frames, the temporal feature fusion module was constructed to align and match the time sequences of feature maps corresponding to different image frames for forming the integrated global feature. Based on the parameter-sharing backbone network, the heat map of generating key point was exploited to detect the positions of pedestrians, vehicles and traffic signal lights on the road, and the semantic segmentation sub-network was built to provide the drivable area information for autonomous vehicles on the road. Analysis results show that the proposed algorithm takes sequential frames as input instead of single frame, which makes effective use of the temporal characteristics of the frames. In addition, its computational complexity with the cascaded feature fusion module greatly reduces to sixteenth of that without the cascaded feature fusion module through downsampling. Compared with other mainstream models, such as CornerNet and ICNet, the detection accuracy and segmentation performance of the algorithm improve by an average of 6% and 5%, respectively, and the image processing speed reaches to 12 frames per second. Therefore, the proposed algorithm has obvious advantages in the speed and accuracy of image detection and segmentation. 6 tabs, 9 figs, 31 refs.
- traffic information engineering /
- environmental perception /
- temporal fusion /
- object detection /
- semantic segmentation

HTML全文

图 1 算法整体架构

Figure 1. Overall structure of algorithm

下载: 全尺寸图片幻灯片

图 2 金字塔池化模块

Figure 2. Pyramid pooling module

下载: 全尺寸图片幻灯片

图 3 多帧图像时序融合模块

Figure 3. Multi-frame image temporal fusion module

下载: 全尺寸图片幻灯片

图 4 金字塔语义池化模块

Figure 4. Pyramid semantic pooling module

下载: 全尺寸图片幻灯片

图 5 部分街景图像样例

Figure 5. Partial street image examples

下载: 全尺寸图片幻灯片

图 6 马赛克数据合成后的图像

Figure 6. Fusion images with mosaic data

下载: 全尺寸图片幻灯片

图 7 多帧图像融合算法

Figure 7. Multi-frame image fusion algorithm

下载: 全尺寸图片幻灯片

图 8 不同目标检测算法的P-r曲线

Figure 8. P-r curves of different object detection methods

下载: 全尺寸图片幻灯片

图 9 算法部分定性结果

Figure 9. Partial qualitative results of algorithm

下载: 全尺寸图片幻灯片

表 1 目标检测子任务试验结果

Table 1. Experimental results of object detection subtask

算法	骨干网络	速度/(帧·s^-1)	平均精确率/%	召回率/%
YOLOv3^[11]	Darknet53	24.0	79.2	84.9
Mask R-CNN^[29]	ResNeXt-101	11.4	84.2	86.2
CornerNet^[12]	Hourglass-104	4.6	86.9	86.5
CenterNet^[13]	ResNet101	6.8	87.6	87.2
TridentNet^[30]	ResNeXt-101-DCN	0.7	91.0	88.3
MadNet(Ours)	ResNet50	12.6	89.8	87.8
MadNet(Ours)	ResNet101-DCN	5.9	91.8	90.1

下载: 导出CSV

表 2 语义分割子任务试验结果

Table 2. Experimental results of semantic segmentation subtask

算法	输入图像尺寸/pixel	速度/(帧·s^-1)	平均交并比/%
ICNet^[6]	1 024×2 048	30.3	67.7
DFF^[31]	512×1 024	5.6	69.2
LERNet^[23]	512×1 024	100.0	69.5
TapLab^[22]	1 024×2 048	99.8	70.6
LVS^[21]	713×713	5.8	76.8
PSPNet101^[5]	713×713	2.8	79.7
MadNet(ResNet50)	1 024×2 048	12.1	78.8
MadNet(ResNet101-DCN)	1 024×2 048	5.4	80.5

下载: 导出CSV

表 3 多任务联合试验结果

Table 3. Multi-task joint experiment results

骨干网络	速度/(帧·s^-1)	平均交并比/%	平均精确率/%	召回率/%
ResNet50	11.5	79.3	90.2	88.4
ResNet101-DCN	5.1	81.6	92.4	90.5

下载: 导出CSV

表 4 级联特征融合模块不同插入位置的试验结果

Table 4. Experimental results of cascade feature fusion module at different insertion positions

插入位置	速度/(帧·s^-1)	平均交并比/%	平均精确率/%	召回率/%
Stage1之后	14.0	74.8	85.7	83.4
Stage2之后	12.8	77.3	88.4	85.7
Stage3之后	11.5	79.3	90.2	88.4
Stage4之后	8.4	79.6	90.5	87.6

下载: 导出CSV

表 5 具体对象的检测精确率

Table 5. Detection precisions of specific objects

平均精确率/%	具体对象的检测精确率/%
平均精确率/%	自行车	卡车	行人	汽车	交通信号灯
92.4	88.5	94.4	96.2	97.7	90.2

下载: 导出CSV

表 6 不同数量图像帧输入对算法性能的影响

Table 6. Influence of different numbers of input image frames on algorithm performance

输入图像帧数	速度/(帧·s^-1)	平均交并比/%	平均精确率/%	召回率/%
2	11.5	79.3	90.2	88.4
3	8.6	79.7	91.2	88.4
4	4.4	79.8	91.2	88.6

下载: 导出CSV

参考文献(31)

[1]	CHEN L C, PAPANDREOU G, KOKKINOS I, et al. Semantic image segmentation with deep convolutional nets and fully connected CRFs[C]//ICLR. 3rd International Conference on Learning Representations. San Diego: ICLR, 2015: 357-361.
[2]	CHEN L C, PAPANDREOU G, KOKKINOS I, et al. DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(4): 834-848. doi: 10.1109/TPAMI.2017.2699184
[3]	CHEN L C, PAPANDREOU G, SCHROFF F, et al. Rethinking atrous convolution for semantic image segmentation[EB/OL]. https://arxiv.org/abs/1706.05587, 2017-08-08/2017-12-05.
[4]	CHEN L C, ZHU Yu-kun, PAPANDREOU G, et al. Encoder-decoder with atrous separable convolution for semantic image segmentation[C]//Springer. 15th European Conference on Computer Vision. Berlin: Springer, 2018: 833-851.
[5]	ZHAO Heng-shuang, SHI Jian-ping, QI Xiao-juan, et al. Pyramid scene parsing network[C]//IEEE. 30th IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2017: 6230-6239.
[6]	ZHAO Heng-shuang, QI Xiao-juan, SHEN Xiao-yong, et al. ICNet for real-time semantic segmentation on high-resolution images[C]//Springer. 15th European Conference on Computer Vision. Berlin: Springer, 2018: 418-434.
[7]	LIU Zhan-wen, QI Ming-yuan, SHEN Chao, et al. Cascade saccade machine learning network with hierarchical classes for traffic sign detection[J]. Sustainable Cities and Society, 2021, 67: 30914-30928. http://www.sciencedirect.com/science/article/pii/S2210670720309148
[8]	REN Shao-qing, HE Kai-ming, GIRSHICK R, et al. Faster R-CNN: towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137-1149. doi: 10.1109/TPAMI.2016.2577031
[9]	REDMON J, DIVVALA S, GIRSHICK R, et al. You only look once: unified, real-time object detection[C]//IEEE. 29th IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2016: 779-788.
[10]	REDMON J, FARHADI A. YOLO9000: better, faster, stronger[C]//IEEE. 30th IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2017: 6517-6525.
[11]	REDMON J, FARHADI A. YOLOv3: an incremental improvement[EB/OL]. https://arxiv.org/abs/1804.02767, 2018-04-08.
[12]	LAW H, DENG Jia. CornerNet: detecting objects as paired keypoints[J]. International Journal of Computer Vision, 2020, 128(3): 642-656. doi: 10.1007/s11263-019-01204-1
[13]	ZHOU Xing-yi, WANG De-quan, KRÄHENBVHL P. Objects as points[EB/OL]. https://arxiv.org/abs/1904.07850v1, 2019-04-16/2019-04-25.
[14]	ZHAO Yi, QI Ming-yuan, LI Xiao-hui, et al. P-LPN: towards real time pedestrian location perception in complex driving scenes[J]. IEEE Access, 2020, 8: 54730-54740. doi: 10.1109/ACCESS.2020.2981821
[15]	TEICHMANN M, WEBER M, ZÖLLNER M, et al. MultiNet: Real-time joint semantic reasoning for autonomous driving[C]//IEEE. 2018 IEEE Intelligent Vehicles Symposium. New York: IEEE, 2018: 1013-1020.
[16]	SISTU G, LEANG I, YOGAMANI S. Real-time joint object detection and semantic segmentation network for automated driving[EB/OL]. https://arxiv.org/abs/1901.03912, 2019-06-12.
[17]	CHEN Zhao, BADRINARAYANAN V, LEE C Y, et al. GradNorm: gradient normalization for adaptive loss balancing in deep multitask networks[C]//ICML. 35th International Conference on Machine Learning. Stockholm: ICML, 2018: 794-803.
[18]	KENDALL A, GAL Y, CIPOLLA R. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics[C]//IEEE. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2018: 7482-7491.
[19]	SENER O, KOLTUN V. Multi-task learning as multi-objective optimization[C]//IFIP. 32nd International Conference on Neural Information Processing Systems. Rome: IFIP, 2017: 525-526.
[20]	ZHAO Xiang-mo, QI Ming-yuan, LIU Zhan-wen, et al. End-to-end autonomous driving decision model joined by attention mechanism and spatiotemporal features[J]. IET Intelligent Transport Systems, 2021, 8: 1119-1130. http://www.researchgate.net/publication/352733796_End-to-end_autonomous_driving_decision_model_joined_by_attention_mechanism_and_spatiotemporal_features
[21]	LI Yu-le, SHI Jian-ping, LIN Da-hua. Low-latency video semantic segmentation[C]//IEEE. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2018: 5997-6005.
[22]	FENG Jun-yi, LI Song-yuan, LI Xi, et al. TapLab: a fast framework for semantic video segmentation tapping into compressed-domain knowledge[J]. IEEE Transactions on Software Engineering, 2020, https://ieeexplore.ieee.org/document/9207876.
[23]	WU Jun-rong, WEN Zong-zheng, ZHAO San-yuan, et al. Video semantic segmentation via feature propagation with holistic attention[J]. Pattern Recognition, 2020, 104, DOI: 10.1016/j.patcog.2020.107268.
[24]	HE Kai-ming, ZHANG Xiang-yu, REN Shao-qing, et al. Identity mappings in deep residual networks[C]//ACM. 14th European Conference on 21st ACM Conference on Computer Vision. Berlin: Springer, 2016: 630-645.
[25]	HU Ping, HEILBRON F C, WANG O, et al. Temporally distributed networks for fast video semantic segmentation[C]//IEEE. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2020: 8815-8824.
[26]	ZHU Zhen, XU Meng-du, BAI Song, et al. Asymmetric non-local neural networks for semantic segmentation[C]//IEEE. 2019 International Conference on Computer Vision. New York: IEEE, 2019: 593-602.
[27]	CORDTS M, OMRAN M, RAMOS S, et al. The cityscapes dataset for semantic urban scene understanding[C]//IEEE. 29th IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2016: 3213-3223.
[28]	YUN S D, HAN D Y, OH S J, et al. CutMix: regularization strategy to train strong classifiers with localizable features[C]//IEEE. 2019 International Conference on Computer Vision. New York: IEEE, 2019: 6022-6031.
[29]	HE Kai-ming, GKIOXARI G, DOLLÁR P, et al. Mask R-CNN[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 42(2): 386-397. doi: 10.1109/TPAMI.2018.2844175
[30]	LI Yang-hao, CHEN Yun-tao, WANG Nai-yan, et al. Scale-aware trident networks for object detection[C]//IEEE. 2019 International Conference on Computer Vision. New York: IEEE, 2019: 6053-6062.
[31]	ZHU Xi-zhou, XIONG Yu-wen, DAI Ji-feng, et al. Deep feature flow for video recognition[C]//IEEE. 30th IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2017: 4141-4150.

施引文献

资源附件(0)

访问统计

点击查看大图

图(9) / 表(6)

计量

文章访问数: 1144
HTML全文浏览量: 274
PDF下载量: 131
被引次数: 0

姓名
邮箱
手机号码
标题
留言内容
验证码

留言板

基于时序融合的自动驾驶多任务感知算法

doi: 10.19818/j.cnki.1671-1637.2021.04.017

作者简介:
刘占文(1983-)，女，山东青岛人，长安大学副教授，工学博士，从事深度学习与计算机视觉研究

通讯作者:
赵祥模(1966-)，男，重庆大足人，长安大学教授，工学博士

计量

Multi-task perception algorithm of autonomous driving based on temporal fusion

Author Bio:
LIU Zhan-wen(1983-), female, associate professor, PhD, zwliu@chd.edu.cn

Corresponding author: ZHAO Xiang-mo(1966-), male, professor, PhD, xmzhao@chd.edu.cn

计量

目录

留言板

基于时序融合的自动驾驶多任务感知算法

doi: 10.19818/j.cnki.1671-1637.2021.04.017

作者简介: 刘占文(1983-)，女，山东青岛人，长安大学副教授，工学博士，从事深度学习与计算机视觉研究

通讯作者: 赵祥模(1966-)，男，重庆大足人，长安大学教授，工学博士

计量

出版历程

Multi-task perception algorithm of autonomous driving based on temporal fusion

Author Bio: LIU Zhan-wen(1983-), female, associate professor, PhD, zwliu@chd.edu.cn

Corresponding author: ZHAO Xiang-mo(1966-), male, professor, PhD, xmzhao@chd.edu.cn

计量

出版历程

目录

作者简介:
刘占文(1983-)，女，山东青岛人，长安大学副教授，工学博士，从事深度学习与计算机视觉研究

通讯作者:
赵祥模(1966-)，男，重庆大足人，长安大学教授，工学博士

Author Bio:
LIU Zhan-wen(1983-), female, associate professor, PhD, zwliu@chd.edu.cn