留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

基于时序融合的自动驾驶多任务感知算法

刘占文 范颂华 齐明远 董鸣 王品 赵祥模

刘占文, 范颂华, 齐明远, 董鸣, 王品, 赵祥模. 基于时序融合的自动驾驶多任务感知算法[J]. 交通运输工程学报, 2021, 21(4): 223-234. doi: 10.19818/j.cnki.1671-1637.2021.04.017
引用本文: 刘占文, 范颂华, 齐明远, 董鸣, 王品, 赵祥模. 基于时序融合的自动驾驶多任务感知算法[J]. 交通运输工程学报, 2021, 21(4): 223-234. doi: 10.19818/j.cnki.1671-1637.2021.04.017
LIU Zhan-wen, FAN Song-hua, QI Ming-yuan, DONG Ming, WANG Pin, ZHAO Xiang-mo. Multi-task perception algorithm of autonomous driving based on temporal fusion[J]. Journal of Traffic and Transportation Engineering, 2021, 21(4): 223-234. doi: 10.19818/j.cnki.1671-1637.2021.04.017
Citation: LIU Zhan-wen, FAN Song-hua, QI Ming-yuan, DONG Ming, WANG Pin, ZHAO Xiang-mo. Multi-task perception algorithm of autonomous driving based on temporal fusion[J]. Journal of Traffic and Transportation Engineering, 2021, 21(4): 223-234. doi: 10.19818/j.cnki.1671-1637.2021.04.017

基于时序融合的自动驾驶多任务感知算法

doi: 10.19818/j.cnki.1671-1637.2021.04.017
基金项目: 

国家自然科学基金项目 U1864204

国家重点研发计划项目 2019YFB1600103

陕西省重点研发计划项目 2018ZDXM-GY-044

详细信息
    作者简介:

    刘占文(1983-),女,山东青岛人,长安大学副教授,工学博士,从事深度学习与计算机视觉研究

    通讯作者:

    赵祥模(1966-),男,重庆大足人,长安大学教授,工学博士

  • 中图分类号: U491.2

Multi-task perception algorithm of autonomous driving based on temporal fusion

Funds: 

National Natural Science Foundation of China U1864204

National Key Research and Development Program of China 2019YFB1600103

Key Research and Development Program of Shaanxi 2018ZDXM-GY-044

More Information
    Author Bio:

    LIU Zhan-wen(1983-), female, associate professor, PhD, zwliu@chd.edu.cn

    Corresponding author: ZHAO Xiang-mo(1966-), male, professor, PhD, xmzhao@chd.edu.cn
  • 摘要: 采用连续图像帧作为输入,挖掘连续图像帧之间的时序关联信息,构建一种融合时序信息的多任务联合驾驶环境视觉感知算法,通过多任务监督联合优化,实现交通参与目标的快速检测,同时获取可通行区域信息;采用ResNet50作为骨干网络,在骨干网络中构建级联特征融合模块,捕捉不同图像帧之间的非局部远程依赖关系,将高分辨率图像通过卷积下采样处理,加速不同图像帧的特征提取过程,平衡算法的精度和速度;在不同的图像帧中,为了消除由于物体运动产生的空间位移对特征融合的影响,且考虑不同图像帧的非局部关联信息,构建时序特征融合模块分别对不同图像帧对应的特征图进行时序对齐与匹配,形成融合全局特征;基于共享参数的骨干网络,利用生成关键点热图的方法对道路中的行人、车辆和交通信号灯的位置进行检测,并利用语义分割子网络为自动驾驶汽车提供道路可行驶区域信息。研究结果表明:提出的感知算法以多帧图像代替单一帧图像作为输入,利用了多帧图像的序列特性,级联特征融合模块通过下采样使得计算复杂度降低为原来的1/16,与CornerNet、ICNet等其他主流模型相比,算法检测精确率平均提升了6%,分割性能平均提升了5%,并保持了每秒12帧图像的处理速度,在检测与分割速度和精度上具有明显优势。

     

  • 图  1  算法整体架构

    Figure  1.  Overall structure of algorithm

    图  2  金字塔池化模块

    Figure  2.  Pyramid pooling module

    图  3  多帧图像时序融合模块

    Figure  3.  Multi-frame image temporal fusion module

    图  4  金字塔语义池化模块

    Figure  4.  Pyramid semantic pooling module

    图  5  部分街景图像样例

    Figure  5.  Partial street image examples

    图  6  马赛克数据合成后的图像

    Figure  6.  Fusion images with mosaic data

    图  7  多帧图像融合算法

    Figure  7.  Multi-frame image fusion algorithm

    图  8  不同目标检测算法的P-r曲线

    Figure  8.  P-r curves of different object detection methods

    图  9  算法部分定性结果

    Figure  9.  Partial qualitative results of algorithm

    表  1  目标检测子任务试验结果

    Table  1.   Experimental results of object detection subtask

    算法 骨干网络 速度/(帧·s-1) 平均精确率/% 召回率/%
    YOLOv3[11] Darknet53 24.0 79.2 84.9
    Mask R-CNN[29] ResNeXt-101 11.4 84.2 86.2
    CornerNet[12] Hourglass-104 4.6 86.9 86.5
    CenterNet[13] ResNet101 6.8 87.6 87.2
    TridentNet[30] ResNeXt-101-DCN 0.7 91.0 88.3
    MadNet(Ours) ResNet50 12.6 89.8 87.8
    MadNet(Ours) ResNet101-DCN 5.9 91.8 90.1
    下载: 导出CSV

    表  2  语义分割子任务试验结果

    Table  2.   Experimental results of semantic segmentation subtask

    算法 输入图像尺寸/pixel 速度/(帧·s-1) 平均交并比/%
    ICNet[6] 1 024×2 048 30.3 67.7
    DFF[31] 512×1 024 5.6 69.2
    LERNet[23] 512×1 024 100.0 69.5
    TapLab[22] 1 024×2 048 99.8 70.6
    LVS[21] 713×713 5.8 76.8
    PSPNet101[5] 713×713 2.8 79.7
    MadNet(ResNet50) 1 024×2 048 12.1 78.8
    MadNet(ResNet101-DCN) 1 024×2 048 5.4 80.5
    下载: 导出CSV

    表  3  多任务联合试验结果

    Table  3.   Multi-task joint experiment results

    骨干网络 速度/(帧·s-1) 平均交并比/% 平均精确率/% 召回率/%
    ResNet50 11.5 79.3 90.2 88.4
    ResNet101-DCN 5.1 81.6 92.4 90.5
    下载: 导出CSV

    表  4  级联特征融合模块不同插入位置的试验结果

    Table  4.   Experimental results of cascade feature fusion module at different insertion positions

    插入位置 速度/(帧·s-1) 平均交并比/% 平均精确率/% 召回率/%
    Stage1之后 14.0 74.8 85.7 83.4
    Stage2之后 12.8 77.3 88.4 85.7
    Stage3之后 11.5 79.3 90.2 88.4
    Stage4之后 8.4 79.6 90.5 87.6
    下载: 导出CSV

    表  5  具体对象的检测精确率

    Table  5.   Detection precisions of specific objects

    平均精确率/% 具体对象的检测精确率/%
    自行车 卡车 行人 汽车 交通信号灯
    92.4 88.5 94.4 96.2 97.7 90.2
    下载: 导出CSV

    表  6  不同数量图像帧输入对算法性能的影响

    Table  6.   Influence of different numbers of input image frames on algorithm performance

    输入图像帧数 速度/(帧·s-1) 平均交并比/% 平均精确率/% 召回率/%
    2 11.5 79.3 90.2 88.4
    3 8.6 79.7 91.2 88.4
    4 4.4 79.8 91.2 88.6
    下载: 导出CSV
  • [1] CHEN L C, PAPANDREOU G, KOKKINOS I, et al. Semantic image segmentation with deep convolutional nets and fully connected CRFs[C]//ICLR. 3rd International Conference on Learning Representations. San Diego: ICLR, 2015: 357-361.
    [2] CHEN L C, PAPANDREOU G, KOKKINOS I, et al. DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(4): 834-848. doi: 10.1109/TPAMI.2017.2699184
    [3] CHEN L C, PAPANDREOU G, SCHROFF F, et al. Rethinking atrous convolution for semantic image segmentation[EB/OL]. https://arxiv.org/abs/1706.05587, 2017-08-08/2017-12-05.
    [4] CHEN L C, ZHU Yu-kun, PAPANDREOU G, et al. Encoder-decoder with atrous separable convolution for semantic image segmentation[C]//Springer. 15th European Conference on Computer Vision. Berlin: Springer, 2018: 833-851.
    [5] ZHAO Heng-shuang, SHI Jian-ping, QI Xiao-juan, et al. Pyramid scene parsing network[C]//IEEE. 30th IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2017: 6230-6239.
    [6] ZHAO Heng-shuang, QI Xiao-juan, SHEN Xiao-yong, et al. ICNet for real-time semantic segmentation on high-resolution images[C]//Springer. 15th European Conference on Computer Vision. Berlin: Springer, 2018: 418-434.
    [7] LIU Zhan-wen, QI Ming-yuan, SHEN Chao, et al. Cascade saccade machine learning network with hierarchical classes for traffic sign detection[J]. Sustainable Cities and Society, 2021, 67: 30914-30928. http://www.sciencedirect.com/science/article/pii/S2210670720309148
    [8] REN Shao-qing, HE Kai-ming, GIRSHICK R, et al. Faster R-CNN: towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137-1149. doi: 10.1109/TPAMI.2016.2577031
    [9] REDMON J, DIVVALA S, GIRSHICK R, et al. You only look once: unified, real-time object detection[C]//IEEE. 29th IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2016: 779-788.
    [10] REDMON J, FARHADI A. YOLO9000: better, faster, stronger[C]//IEEE. 30th IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2017: 6517-6525.
    [11] REDMON J, FARHADI A. YOLOv3: an incremental improvement[EB/OL]. https://arxiv.org/abs/1804.02767, 2018-04-08.
    [12] LAW H, DENG Jia. CornerNet: detecting objects as paired keypoints[J]. International Journal of Computer Vision, 2020, 128(3): 642-656. doi: 10.1007/s11263-019-01204-1
    [13] ZHOU Xing-yi, WANG De-quan, KRÄHENBVHL P. Objects as points[EB/OL]. https://arxiv.org/abs/1904.07850v1, 2019-04-16/2019-04-25.
    [14] ZHAO Yi, QI Ming-yuan, LI Xiao-hui, et al. P-LPN: towards real time pedestrian location perception in complex driving scenes[J]. IEEE Access, 2020, 8: 54730-54740. doi: 10.1109/ACCESS.2020.2981821
    [15] TEICHMANN M, WEBER M, ZÖLLNER M, et al. MultiNet: Real-time joint semantic reasoning for autonomous driving[C]//IEEE. 2018 IEEE Intelligent Vehicles Symposium. New York: IEEE, 2018: 1013-1020.
    [16] SISTU G, LEANG I, YOGAMANI S. Real-time joint object detection and semantic segmentation network for automated driving[EB/OL]. https://arxiv.org/abs/1901.03912, 2019-06-12.
    [17] CHEN Zhao, BADRINARAYANAN V, LEE C Y, et al. GradNorm: gradient normalization for adaptive loss balancing in deep multitask networks[C]//ICML. 35th International Conference on Machine Learning. Stockholm: ICML, 2018: 794-803.
    [18] KENDALL A, GAL Y, CIPOLLA R. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics[C]//IEEE. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2018: 7482-7491.
    [19] SENER O, KOLTUN V. Multi-task learning as multi-objective optimization[C]//IFIP. 32nd International Conference on Neural Information Processing Systems. Rome: IFIP, 2017: 525-526.
    [20] ZHAO Xiang-mo, QI Ming-yuan, LIU Zhan-wen, et al. End-to-end autonomous driving decision model joined by attention mechanism and spatiotemporal features[J]. IET Intelligent Transport Systems, 2021, 8: 1119-1130. http://www.researchgate.net/publication/352733796_End-to-end_autonomous_driving_decision_model_joined_by_attention_mechanism_and_spatiotemporal_features
    [21] LI Yu-le, SHI Jian-ping, LIN Da-hua. Low-latency video semantic segmentation[C]//IEEE. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2018: 5997-6005.
    [22] FENG Jun-yi, LI Song-yuan, LI Xi, et al. TapLab: a fast framework for semantic video segmentation tapping into compressed-domain knowledge[J]. IEEE Transactions on Software Engineering, 2020, https://ieeexplore.ieee.org/document/9207876.
    [23] WU Jun-rong, WEN Zong-zheng, ZHAO San-yuan, et al. Video semantic segmentation via feature propagation with holistic attention[J]. Pattern Recognition, 2020, 104, DOI: 10.1016/j.patcog.2020.107268.
    [24] HE Kai-ming, ZHANG Xiang-yu, REN Shao-qing, et al. Identity mappings in deep residual networks[C]//ACM. 14th European Conference on 21st ACM Conference on Computer Vision. Berlin: Springer, 2016: 630-645.
    [25] HU Ping, HEILBRON F C, WANG O, et al. Temporally distributed networks for fast video semantic segmentation[C]//IEEE. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2020: 8815-8824.
    [26] ZHU Zhen, XU Meng-du, BAI Song, et al. Asymmetric non-local neural networks for semantic segmentation[C]//IEEE. 2019 International Conference on Computer Vision. New York: IEEE, 2019: 593-602.
    [27] CORDTS M, OMRAN M, RAMOS S, et al. The cityscapes dataset for semantic urban scene understanding[C]//IEEE. 29th IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2016: 3213-3223.
    [28] YUN S D, HAN D Y, OH S J, et al. CutMix: regularization strategy to train strong classifiers with localizable features[C]//IEEE. 2019 International Conference on Computer Vision. New York: IEEE, 2019: 6022-6031.
    [29] HE Kai-ming, GKIOXARI G, DOLLÁR P, et al. Mask R-CNN[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 42(2): 386-397. doi: 10.1109/TPAMI.2018.2844175
    [30] LI Yang-hao, CHEN Yun-tao, WANG Nai-yan, et al. Scale-aware trident networks for object detection[C]//IEEE. 2019 International Conference on Computer Vision. New York: IEEE, 2019: 6053-6062.
    [31] ZHU Xi-zhou, XIONG Yu-wen, DAI Ji-feng, et al. Deep feature flow for video recognition[C]//IEEE. 30th IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2017: 4141-4150.
  • 加载中
图(9) / 表(6)
计量
  • 文章访问数:  1144
  • HTML全文浏览量:  274
  • PDF下载量:  131
  • 被引次数: 0
出版历程
  • 收稿日期:  2021-02-03
  • 网络出版日期:  2021-09-16
  • 刊出日期:  2021-08-01

目录

    /

    返回文章
    返回