Multi-task perception algorithm of autonomous driving based on temporal fusion

LIU Zhan-wen; FAN Song-hua; QI Ming-yuan; DONG Ming; WANG Pin; ZHAO Xiang-mo

doi:10.19818/j.cnki.1671-1637.2021.04.017

Volume 21 Issue 4

Sep. 2021

Turn off MathJax

Article Contents

Abstract

FullText

0. introduction

1. Algorithm architecture

2. Dataset and evaluation indicators

3. Analysis of test results

4. 结语

References

Article Navigation > Journal of Traffic and Transportation Engineering > 2021 > 21(4): 223-234

Next Previous

CAO Jian-ming, WU Tao, CHENG Qian, QI Dong-hui, BIAN Yao-zhang. Atomization characteristics comparison between diesel and LPG/diesel dual fuel[J]. Journal of Traffic and Transportation Engineering, 2003, 3(2): 40-44.

Citation:

LIU Zhan-wen, FAN Song-hua, QI Ming-yuan, DONG Ming, WANG Pin, ZHAO Xiang-mo. Multi-task perception algorithm of autonomous driving based on temporal fusion[J]. Journal of Traffic and Transportation Engineering, 2021, 21(4): 223-234. doi: 10.19818/j.cnki.1671-1637.2021.04.017

Citation:

PDF( 15027 KB)

Multi-task perception algorithm of autonomous driving based on temporal fusion

doi: 10.19818/j.cnki.1671-1637.2021.04.017

1.
School of Information Engineering, Chang'an University, Xi'an 710064, Shaanxi, China
2.
University of California, Berkeley, Berkeley 94804-4648, California, USA

Funds:

National Natural Science Foundation of China U1864204

National Key Research and Development Program of China 2019YFB1600103

Key Research and Development Program of Shaanxi 2018ZDXM-GY-044

More Information

Author Bio:
LIU Zhan-wen(1983-), female, associate professor, PhD, zwliu@chd.edu.cn
Corresponding author: ZHAO Xiang-mo(1966-), male, professor, PhD, xmzhao@chd.edu.cn
Received Date: 2021-02-03
Available Online: 2021-09-16
Publish Date: 2021-08-01

Abstract

Abstract

The sequential image frames were used as input to mine the temporal associated information among the continuous image frames, and a multi-task joint driving environment perception algorithm fusing the temporal information was constructed to rapidly detect the traffic participation targets and drivable area through multi-task supervision and joint optimization. ResNet50 was used as the backbone network, in which a cascaded feature fusion module was built to capture the non-local remote dependence among different image frames. The high-resolution images were processed by the convolution subsampling to accelerate the feature extraction process of different image frames, balancing the detection accuracy and speed of the algorithm. In order to eliminate the influence of spatial displacements of the objects among the image frames on the feature fusion, and considering the non-local dependence of the features of different image frames, the temporal feature fusion module was constructed to align and match the time sequences of feature maps corresponding to different image frames for forming the integrated global feature. Based on the parameter-sharing backbone network, the heat map of generating key point was exploited to detect the positions of pedestrians, vehicles and traffic signal lights on the road, and the semantic segmentation sub-network was built to provide the drivable area information for autonomous vehicles on the road. Analysis results show that the proposed algorithm takes sequential frames as input instead of single frame, which makes effective use of the temporal characteristics of the frames. In addition, its computational complexity with the cascaded feature fusion module greatly reduces to sixteenth of that without the cascaded feature fusion module through downsampling. Compared with other mainstream models, such as CornerNet and ICNet, the detection accuracy and segmentation performance of the algorithm improve by an average of 6% and 5%, respectively, and the image processing speed reaches to 12 frames per second. Therefore, the proposed algorithm has obvious advantages in the speed and accuracy of image detection and segmentation. 6 tabs, 9 figs, 31 refs.
- traffic information engineering,
- environmental perception,
- temporal fusion,
- object detection,
- semantic segmentation

FullText(HTML)

FullText

Disclaimer: The English version of this article is automatically generated by Baidu Translation and only for reference. We therefore are not responsible for its reasonableness, correctness and completeness, and will not bear any commercial and legal responsibilities for the relevant consequences arising from the English translation.

0. introduction

In recent years, autonomous driving has been included in the national development strategies of many developed countries. China has also made a series of policy deployments in intelligent connected vehicles, autonomous driving, smart transportation, and other areas, aiming to break through traditional technological bottlenecks, further promote the deep integration of information technologies such as automobiles, artificial intelligence, and communication, and promote the development of the autonomous driving ecosystem industry. Deep learning has achieved excellent results in both object detection and semantic segmentation fields. The DeepLab series of networks proposed by Google Brain in the early days adopted deep learning methods and constantly innovated, making the model make significant breakthroughs in pixel level image segmentation tasks^[1-4]Subsequently, Zhao et al^[5-6]PSPNet was proposed to introduce a pyramid pooling module on the basis of residual networks, aggregating contextual information based on different regions, improving the network's ability to mine global contextual information, and greatly enhancing the performance of semantic segmentation models; And the proposed real-time segmentation network ICNet greatly shortens the processing time of the semantic segmentation network with minimal loss of segmentation accuracy, enabling the semantic segmentation model to process video streams in real time, making it possible to apply it in the field of autonomous driving. And the object detection model can accurately locate and detect objects in the road environment^[7]， Ren et al^[8]The proposed Faster R-CNN object detection model achieves end-to-end training, greatly accelerating inference speed compared to previous models; Redmon and others^[9-11]Propose the YOLO series model, which takes the entire image as an instance to quickly extract feature maps and predict bounding boxes and their corresponding categories. Especially, the accuracy and real-time performance of the YOLOv3 model have become a commonly used solution for object detection in the industrial field.

However, both YOLO series and Faster R-CNN series use anchor boxes to locate target positions, which invisibly introduces many hyperparameters and increases the difficulty of network training. Law, etc^[12-13]The proposed model abandons anchor boxes in object detection and instead uses keypoint detection, introducing new ideas for object detection methods and accelerating the inference time of the network, making the object detection model more suitable for real-time tasks such as autonomous driving.

With the continuous development of deep learning, people's focus is no longer on single tasks, but on the hope that deep learning models can process multiple tasks in parallel. Zhao et al^[14]A real-time road environment perception model was constructed for autonomous driving scenarios, and pedestrian position detection was achieved by integrating lightweight RPN networks; Teichmann et al^[15]A joint classification, detection, and semantic segmentation method has been developed for autonomous driving scenarios. Although it involves multitasking, dividing detection and classification into two sub tasks increases the complexity of the network; Sistu and others^[16]A multi task model for autonomous driving scenarios has been proposed, which includes two sub tasks of detection and segmentation. The low-power real-time performance of the model is achieved by sharing the encoder of the two sub tasks; Zhao et al^[14-16]The proposed model is also aimed at autonomous driving tasks, but it does not take into account that the input data in autonomous driving tasks is a continuous video stream. Using a single frame image for detection and segmentation will result in the loss of useful information between image frames. At the same time, the multi task network constructed does not analyze the loss proportion of each subtask, and there is a problem of imbalanced loss among multiple tasks during training, which will lead to multiple tasks being unable to achieve optimal results simultaneously; Chen et al^[17-20]We conducted in-depth research on this and optimized the weights of various losses through model self-learning, so that there is no situation where a certain subtask dominates training and learning while ignoring other subtasks; When the input data is a video stream, in order to effectively utilize the useful information between image frames, Li et al^[21]Propose to select key image frames through adaptive strategies based on DFF, but the model combined with optical flow is only suitable for scenes with slow environmental changes. If applied in rapidly changing scenes, the segmentation accuracy will be greatly reduced, and the calculation cost of optical flow is expensive, resulting in slow network inference speed; Feng et al^[22]Designed a fast feature deformation module that uses motion vectors for acceleration, and reduced motion vector noise through residual guided correction and selection modules; Wu et al^[23]Proposed the use of attention modules to extract advanced features from non key image frames for fusion, balancing inference speed and accuracy; Feng et al^[22-23]Although the proposed models have improved segmentation performance, they all come at the cost of sacrificing accuracy and achieving faster network processing speeds.

In response to the problem that existing perception algorithms cannot balance detection accuracy and inference speed, this paper adopts a stable residual network as the backbone network, considers the potential relationship between image frames, and constructs a backbone network based on temporal fusion to extract features from continuous video data; Add cascaded feature fusion modules to the backbone network to maximize the accuracy of video stream processing while meeting real-time processing requirements; Considering the coupling relationship between multiple tasks, the algorithm self learns to optimize the weight of each task loss, obtains the optimal weight ratio, and constructs a multi task joint network for semantic segmentation and object detection, achieving accurate perception of the autonomous driving environment.

1. Algorithm architecture

Autonomous vehicle need to perceive the passable area and its main traffic participation targets (pedestrians, vehicles, etc.) in real time during driving. The framework of the Multi Task Joint Driving Environment Visual Perception Algorithm (MadNet) proposed in this article, which integrates temporal information, is as follows:Figure 1As shown: Firstly, using ResNet^[24]As a backbone network for efficient temporal feature fusion, in order to increase the receptive field of the network, the convolutions in Stage 4 and Stage 5 of the ResNet structure were replaced with dilated convolutions with a ratio of 1, using two consecutive frames of imagest、t-1 frame of image as input; Secondly, a cascaded feature fusion module is used to balance efficiency and accuracy, resulting in shallow feature maps that contain more detailed informationF₂Deep feature maps containing more semantic informationF₁Integrating and balancing algorithm accuracy and inference speed; Then, theF₁、F₂Input temporal feature fusion module to capture non local remote dependencies between image frames, throught-1 frame of image andtKey feature maps of frame imagesK_t-1Align feature mapsQ_tSemantic feature mapV_t-1giveV_tFusion, matching and aligning temporal features between image frames, resulting in a fused feature mapF_tCan provide richer semantic information for subsequent subtask networks; Finally, the semantic segmentation module in the semantic segmentation sub network is used to extract road pixels from the fused feature map. The anchor free heatmap in the object detection sub network is used to detect the center points of traffic participating targets and generate corresponding bounding boxes for them.

Figure 1. Overall structure of algorithm

DownLoad: Full-Size Img PowerPoint

1.1 Efficient temporal feature fusion backbone network

be directed againstt-1-frame feature mapX_t-1∈ R^C×H×W(CFor the number of channels in the feature map,HFor the length of the feature map,WFor the width of the feature map, two 1x1 convolution kernels and their downsampling are used as two branches, with one branch utilizing pyramid pooling to generate the semantic feature mapV_t-1∈ R^c×h×w, such asFigure 2As shown, another branch generates key feature mapsK_t-1∈ R^c×h×w(c=C/8，h=H/4，w=W/4，c、h、wThe number of channels, length, and width of the feature map after downsampling.

Figure 2. Pyramid pooling module

DownLoad: Full-Size Img PowerPoint

be directed againsttFrame feature mapX_t∈ R^C×H×WTwo 1x1 convolution kernels are used to downsample and generate semantic feature mapsV_t∈R^c×h×wAlign feature maps withQ_t∈R^c×h×wAmong them, the semantic feature mapV_t-1Used to provide rich semantic information and align feature mapsQ_tUsed for key feature mapsK_t-1Integration, realizationtFrame image andt-The temporal alignment between 1 frame of images and their temporal correlationA_tdescribed as

${A_t} = S({\mathit{\boldsymbol{Q}}_t}\mathit{\boldsymbol{K}}_{t - 1}^{\rm{T}}/\eta )$

(1)

In the formula:ηAs a parameter, it is usually taken as $\sqrt c$ ；S(·) is the Softmax activation function.

Let the fused feature map beF_tThe fusion process can be described as

${\mathit{\boldsymbol{F}}_t} = {C_{\rm{c}}}({\mathit{\boldsymbol{V}}_t}, {\mathit{\boldsymbol{A}}_t}{\mathit{\boldsymbol{V}}_{t - 1}})$

(2)

In the formula:C_c(·) is the stacking operation in the channel direction between feature maps, where all feature maps have the same length and width.

$\begin{array}{l} {\mathit{\boldsymbol{F}}_t} = {C_{\rm{c}}}\left[ {\left( {\prod\limits_{i = 0}^{n + 1} {{\mathit{\boldsymbol{A}}_{t - i}}} } \right){\mathit{\boldsymbol{F}}_{t - n - 1}}, } \right.\\ \left. {\left( {\prod\limits_{i = 0}^n {{\mathit{\boldsymbol{A}}_{t - i}}} } \right){\mathit{\boldsymbol{V}}_{t - n}}, \ldots , {\mathit{\boldsymbol{A}}_t}{\mathit{\boldsymbol{V}}_{t - 1}}, {\mathit{\boldsymbol{V}}_t}} \right] \end{array}$

(3)

${\mathit{\boldsymbol{A}}_{t - i}} = S({\mathit{\boldsymbol{Q}}_{t - i}}\mathit{\boldsymbol{K}}_{t - i - 1}^{\rm{T}}/\eta )$

(4)

Figure 3. Multi-frame image temporal fusion module

DownLoad: Full-Size Img PowerPoint

As the number of input image frames increases, the computational complexity of the algorithm gradually increases. The impact of different numbers of input image frames on algorithm performance was discussed in detail in the experiment.

In the forward propagation process, a feature fusion module is used to accelerate the algorithm, such asFigure 1 (c)As shown. Specifically, the input image is divided into two branches after being processed by the ResNet network in Stage 3. One branch directly downsamples the feature map propagated to the deep network twice and enters Stage 4, while the other branch generates the feature mapF₂Feature maps processed by backbone networksF₁At the same time, input the cascaded feature fusion module for processing. Firstly, regardingF₁Perform 2x upsampling and use 3x3 dilated convolution to maintain the size of the receptive field; Secondly, regardingF₂After 1 × 1 convolution processing and dilated convolutionF₁Merge and output.F₁Although some detail information was lost in the downsampling before Stage 4, it also achieved faster propagation of semantic feature information, compared to preserving more detail information in the imageF₂Parallel input to the cascaded feature fusion module for stacking, resulting in the stacked feature mapX_nTo balance the accuracy and speed of the algorithm, the process can be described as

${\mathit{\boldsymbol{X}}_n} = R\left\{ {{C_{\rm{c}}}[{\gamma _1}({\mathit{\boldsymbol{F}}_1}), {\gamma _2}({\mathit{\boldsymbol{F}}_2})]} \right\}$

(5)

1.2 目标检测子网络

To detect traffic participation elements in each frame of the image, the first step is to usecThe heatmap of the channel predicts the center points of elements such as pedestrians and vehicles in the image, and then regresses the bounding box for each object, where each heatmap contains a detection category. This anchor free detection method has the advantages of fewer hyperparameters, flexibility, and lightweight, and can achieve better detection results compared to algorithms that require anchor box settings. Specifically, when creating supervised data, use the horizontal and vertical coordinates of the top left corner of the object border in the dataset（x_lt, y_lt）Horizontal and vertical coordinates of the bottom right corner point（x_rb, y_rb）Obtain the center point of the objectpHorizontal and vertical axis label values（p_x, p_y）Due to the lower resolution of the heatmap compared to the data in the dataset, calculate the horizontal and vertical label values of the center point on the corresponding heatmap ${\bar p_x} = \left\lfloor {{p_x}/m} \right\rfloor , {\bar p_y} = \left\lfloor {{p_y}/m} \right\rfloor$ Generate points on the heatmap using Gaussian kernel formula based on the center point（x, y）The weight ofYThat is

$Y = {\rm{exp}}\left\{ { - [{{(x - {{\bar p}_x})}^2} + {{(y - {{\bar p}_y})}^2}]/2\sigma _p^2} \right\}$

(6)

In the formula:σ_pFor adaptive standard deviation.

If two Gaussian regions overlap, select the point with the larger weight. Using logistic regression to calculate the prediction loss of points on the heatmapL_hThat is

${L_{\rm{h}}} = \frac{{ - 1}}{N}\sum\limits_{x, y, c} {\left\{ {\begin{array}{*{20}{l}} {{{\left( {1 - \hat Y} \right)}^a}{\rm{lg}}\left( {\hat Y} \right)}&{Y = 1}\\ {{{\left( {1 - Y} \right)}^\beta }{{\hat Y}^a}{\rm{lg}}\left( {1 - \hat Y} \right)}&{{\rm{其他}}} \end{array}} \right.}$

(7)

Due to the displacement of the predicted center point during the forward propagation of the image, an additional offset is introduced to correct the discretization error, referred to as the lossL.

${L_{\rm{o}}} = \frac{1}{N}\sum\limits_p {\left| {{{\mathit{\boldsymbol{\hat O}}}_{\bar p}} - \left( {\frac{\mathit{\boldsymbol{p}}}{m} - \mathit{\boldsymbol{\bar p}}} \right)} \right|}$

(8)

In the formula: ${\mathit{\boldsymbol{\hat O}}_{\bar p}}$ It is the center point on the heatmappThe predicted offset; (p/m-p）The offset between the center point of the object and the center point of the heatmap.

${L_{\rm{s}}} = \left| {\mathit{\boldsymbol{\hat s}} - \mathit{\boldsymbol{s}}} \right|$

(9)

1.3 语义分割子网络

${L_{{\rm{seg}}}} = - \frac{1}{N}\sum\limits_{x, y, 2c} {[Y{\rm{lg}}\left( {\hat Y} \right) + \left( {1 - Y} \right){\rm{lg}}\left( {1 - \hat Y} \right)]}$

(10)

Figure 4. Pyramid semantic pooling module

DownLoad: Full-Size Img PowerPoint

In the formula: $\hat Y$ Predict the semantic values of points on the feature map.

The multi task algorithm constructed can simultaneously achieve semantic segmentation and object detection tasks. Parameter sharing is adopted in the backbone network, so no additional parameter quantity is introduced, keeping the network lightweight and achieving feature sharing. Through end-to-end supervised learning and multi task joint optimization of network parameters, parameter sharing and information complementarity are achieved to enhance the overall performance of the network and improve the running speed of the algorithm.

2. Dataset and evaluation indicators

2.1 Dataset Introduction

For the algorithm in this article, the Cityscapes_Sequence dataset is selected^[27]Conduct training. Cityscapes_Sequence contains 5000 video clips (totaling 150000 frames of images) and semantic labels corresponding to each frame of image. Among them, 2975 video clips are used for training, 500 video clips are used for validation, and 1525 video clips are used for testing. The high-definition camera at the front of the vehicle was used to capture the video in the dataset, which includes the street views of 18 cities in Europe. Some of the street view data is displayed as followsFigure 5As shown.

Figure 5. Partial street image examples

DownLoad: Full-Size Img PowerPoint

2.2 数据集预处理

Due to the fact that the Cityscapes_Sequence dataset is a semantic segmentation test dataset, it does not provide the corresponding border position data for each object required in the object detection dataset. Therefore, the image data annotation software Labeling is first used to annotate the images with semantic labels in the Cityscapes_Sequence dataset. We have annotated the border positions of common road elements such as pedestrians, vehicles, and traffic signals in the dataset for potential targets encountered during autonomous driving.

In the data augmentation stage before training, in addition to common data augmentation methods such as random rotation, cropping, and translation, mosaic data augmentation is introduced^[28]Mixing multiple frames of images into a new image input network can greatly enrich the contextual information of the images and enhance the robustness of the algorithm. in compliance withFigure 6As shown, the mosaic data augmentation method is used to mix 4 training images into one frame and input it into the network.

Figure 6. Fusion images with mosaic data

DownLoad: Full-Size Img PowerPoint

2.3 算法损失函数与性能评价指标

$L = \frac{{{L_{{\rm{obj}}}}}}{{2\sigma _1^2}} + \frac{{{L_{{\rm{seg}}}}}}{{2\sigma _2^2}} + {\rm{lg}}(\sigma _1^2\sigma _2^2)$

(11)

$\left\{ {\begin{array}{*{20}{l}} {P = \frac{T}{{T + F}}}\\ {r = \frac{T}{{T + F\prime }}} \end{array}} \right.$

(12)

In the formula:TTo predict the number of positive samples as positive samples;FTo predict the number of negative samples as positive samples;FThe number of samples required to predict positive samples as negative samples.

$\varepsilon = ({G_1} \cap {G_2})/({G_1} \cup {G_2})$

(13)

3. Analysis of test results

3.1 目标检测子任务试验

3.1.1 子任务参数设置

3.1.2 试验结果

Table 1. Experimental results of object detection subtask

算法	骨干网络	速度/(帧·s^-1)	平均精确率/%	召回率/%
YOLOv3^[11]	Darknet53	24.0	79.2	84.9
Mask R-CNN^[29]	ResNeXt-101	11.4	84.2	86.2
CornerNet^[12]	Hourglass-104	4.6	86.9	86.5
CenterNet^[13]	ResNet101	6.8	87.6	87.2
TridentNet^[30]	ResNeXt-101-DCN	0.7	91.0	88.3
MadNet(Ours)	ResNet50	12.6	89.8	87.8
MadNet(Ours)	ResNet101-DCN	5.9	91.8	90.1

| Show Table

DownLoad: CSV

3.2 语义分割子任务试验

3.2.1 子任务参数设置

3.2.2 试验结果

Table 2. Experimental results of semantic segmentation subtask

算法	输入图像尺寸/pixel	速度/(帧·s^-1)	平均交并比/%
ICNet^[6]	1 024×2 048	30.3	67.7
DFF^[31]	512×1 024	5.6	69.2
LERNet^[23]	512×1 024	100.0	69.5
TapLab^[22]	1 024×2 048	99.8	70.6
LVS^[21]	713×713	5.8	76.8
PSPNet101^[5]	713×713	2.8	79.7
MadNet(ResNet50)	1 024×2 048	12.1	78.8
MadNet(ResNet101-DCN)	1 024×2 048	5.4	80.5

| Show Table

DownLoad: CSV

3.3 多任务试验

Table 3. Multi-task joint experiment results

骨干网络	速度/(帧·s^-1)	平均交并比/%	平均精确率/%	召回率/%
ResNet50	11.5	79.3	90.2	88.4
ResNet101-DCN	5.1	81.6	92.4	90.5

| Show Table

DownLoad: CSV

Table 4. Experimental results of cascade feature fusion module at different insertion positions

插入位置	速度/(帧·s^-1)	平均交并比/%	平均精确率/%	召回率/%
Stage1之后	14.0	74.8	85.7	83.4
Stage2之后	12.8	77.3	88.4	85.7
Stage3之后	11.5	79.3	90.2	88.4
Stage4之后	8.4	79.6	90.5	87.6

| Show Table

DownLoad: CSV

Table 5. Detection precisions of specific objects

平均精确率/%	具体对象的检测精确率/%
平均精确率/%	自行车	卡车	行人	汽车	交通信号灯
92.4	88.5	94.4	96.2	97.7	90.2

| Show Table

DownLoad: CSV

Figure 7. Multi-frame image fusion algorithm

DownLoad: Full-Size Img PowerPoint

Table 6. Influence of different numbers of input image frames on algorithm performance

输入图像帧数	速度/(帧·s^-1)	平均交并比/%	平均精确率/%	召回率/%
2	11.5	79.3	90.2	88.4
3	8.6	79.7	91.2	88.4
4	4.4	79.8	91.2	88.6

| Show Table

DownLoad: CSV

Figure 8. P-r curves of different object detection methods

DownLoad: Full-Size Img PowerPoint

Figure 9. Partial qualitative results of algorithm

DownLoad: Full-Size Img PowerPoint

4. 结语

References(31)

References

[1]	CHEN L C, PAPANDREOU G, KOKKINOS I, et al. Semantic image segmentation with deep convolutional nets and fully connected CRFs[C]//ICLR. 3rd International Conference on Learning Representations. San Diego: ICLR, 2015: 357-361.
[2]	CHEN L C, PAPANDREOU G, KOKKINOS I, et al. DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(4): 834-848. doi: 10.1109/TPAMI.2017.2699184
[3]	CHEN L C, PAPANDREOU G, SCHROFF F, et al. Rethinking atrous convolution for semantic image segmentation[EB/OL]. https://arxiv.org/abs/1706.05587, 2017-08-08/2017-12-05.
[4]	CHEN L C, ZHU Yu-kun, PAPANDREOU G, et al. Encoder-decoder with atrous separable convolution for semantic image segmentation[C]//Springer. 15th European Conference on Computer Vision. Berlin: Springer, 2018: 833-851.
[5]	ZHAO Heng-shuang, SHI Jian-ping, QI Xiao-juan, et al. Pyramid scene parsing network[C]//IEEE. 30th IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2017: 6230-6239.
[6]	ZHAO Heng-shuang, QI Xiao-juan, SHEN Xiao-yong, et al. ICNet for real-time semantic segmentation on high-resolution images[C]//Springer. 15th European Conference on Computer Vision. Berlin: Springer, 2018: 418-434.
[7]	LIU Zhan-wen, QI Ming-yuan, SHEN Chao, et al. Cascade saccade machine learning network with hierarchical classes for traffic sign detection[J]. Sustainable Cities and Society, 2021, 67: 30914-30928. http://www.sciencedirect.com/science/article/pii/S2210670720309148
[8]	REN Shao-qing, HE Kai-ming, GIRSHICK R, et al. Faster R-CNN: towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137-1149. doi: 10.1109/TPAMI.2016.2577031
[9]	REDMON J, DIVVALA S, GIRSHICK R, et al. You only look once: unified, real-time object detection[C]//IEEE. 29th IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2016: 779-788.
[10]	REDMON J, FARHADI A. YOLO9000: better, faster, stronger[C]//IEEE. 30th IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2017: 6517-6525.
[11]	REDMON J, FARHADI A. YOLOv3: an incremental improvement[EB/OL]. https://arxiv.org/abs/1804.02767, 2018-04-08.
[12]	LAW H, DENG Jia. CornerNet: detecting objects as paired keypoints[J]. International Journal of Computer Vision, 2020, 128(3): 642-656. doi: 10.1007/s11263-019-01204-1
[13]	ZHOU Xing-yi, WANG De-quan, KRÄHENBVHL P. Objects as points[EB/OL]. https://arxiv.org/abs/1904.07850v1, 2019-04-16/2019-04-25.
[14]	ZHAO Yi, QI Ming-yuan, LI Xiao-hui, et al. P-LPN: towards real time pedestrian location perception in complex driving scenes[J]. IEEE Access, 2020, 8: 54730-54740. doi: 10.1109/ACCESS.2020.2981821
[15]	TEICHMANN M, WEBER M, ZÖLLNER M, et al. MultiNet: Real-time joint semantic reasoning for autonomous driving[C]//IEEE. 2018 IEEE Intelligent Vehicles Symposium. New York: IEEE, 2018: 1013-1020.
[16]	SISTU G, LEANG I, YOGAMANI S. Real-time joint object detection and semantic segmentation network for automated driving[EB/OL]. https://arxiv.org/abs/1901.03912, 2019-06-12.
[17]	CHEN Zhao, BADRINARAYANAN V, LEE C Y, et al. GradNorm: gradient normalization for adaptive loss balancing in deep multitask networks[C]//ICML. 35th International Conference on Machine Learning. Stockholm: ICML, 2018: 794-803.
[18]	KENDALL A, GAL Y, CIPOLLA R. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics[C]//IEEE. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2018: 7482-7491.
[19]	SENER O, KOLTUN V. Multi-task learning as multi-objective optimization[C]//IFIP. 32nd International Conference on Neural Information Processing Systems. Rome: IFIP, 2017: 525-526.
[20]	ZHAO Xiang-mo, QI Ming-yuan, LIU Zhan-wen, et al. End-to-end autonomous driving decision model joined by attention mechanism and spatiotemporal features[J]. IET Intelligent Transport Systems, 2021, 8: 1119-1130. http://www.researchgate.net/publication/352733796_End-to-end_autonomous_driving_decision_model_joined_by_attention_mechanism_and_spatiotemporal_features
[21]	LI Yu-le, SHI Jian-ping, LIN Da-hua. Low-latency video semantic segmentation[C]//IEEE. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2018: 5997-6005.
[22]	FENG Jun-yi, LI Song-yuan, LI Xi, et al. TapLab: a fast framework for semantic video segmentation tapping into compressed-domain knowledge[J]. IEEE Transactions on Software Engineering, 2020, https://ieeexplore.ieee.org/document/9207876.
[23]	WU Jun-rong, WEN Zong-zheng, ZHAO San-yuan, et al. Video semantic segmentation via feature propagation with holistic attention[J]. Pattern Recognition, 2020, 104, DOI: 10.1016/j.patcog.2020.107268.
[24]	HE Kai-ming, ZHANG Xiang-yu, REN Shao-qing, et al. Identity mappings in deep residual networks[C]//ACM. 14th European Conference on 21st ACM Conference on Computer Vision. Berlin: Springer, 2016: 630-645.
[25]	HU Ping, HEILBRON F C, WANG O, et al. Temporally distributed networks for fast video semantic segmentation[C]//IEEE. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2020: 8815-8824.
[26]	ZHU Zhen, XU Meng-du, BAI Song, et al. Asymmetric non-local neural networks for semantic segmentation[C]//IEEE. 2019 International Conference on Computer Vision. New York: IEEE, 2019: 593-602.
[27]	CORDTS M, OMRAN M, RAMOS S, et al. The cityscapes dataset for semantic urban scene understanding[C]//IEEE. 29th IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2016: 3213-3223.
[28]	YUN S D, HAN D Y, OH S J, et al. CutMix: regularization strategy to train strong classifiers with localizable features[C]//IEEE. 2019 International Conference on Computer Vision. New York: IEEE, 2019: 6022-6031.
[29]	HE Kai-ming, GKIOXARI G, DOLLÁR P, et al. Mask R-CNN[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 42(2): 386-397. doi: 10.1109/TPAMI.2018.2844175
[30]	LI Yang-hao, CHEN Yun-tao, WANG Nai-yan, et al. Scale-aware trident networks for object detection[C]//IEEE. 2019 International Conference on Computer Vision. New York: IEEE, 2019: 6053-6062.
[31]	ZHU Xi-zhou, XIONG Yu-wen, DAI Ji-feng, et al. Deep feature flow for video recognition[C]//IEEE. 30th IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2017: 4141-4150.

Relative Articles

[1]	DING Jian-ming, ZHOU Jing-yao, JIANG Hai-fan. In-vehicle image technology for identifying faults of pantograph[J]. Journal of Traffic and Transportation Engineering, 2023, 23(3): 173-187. doi: 10.19818/j.cnki.1671-1637.2023.03.013
[2]	PENG Jia-li, SHANGGUAN Wei, CHAI Lin-guo, QIU Wei-zhi. Car-following model and optimization strategy for connected and automated vehicles under mixed traffic environment[J]. Journal of Traffic and Transportation Engineering, 2023, 23(3): 232-247. doi: 10.19818/j.cnki.1671-1637.2023.03.018
[3]	ZHANG Wei-guang, ZHONG Jing-tao, HUYAN Ju, MA Tao, ZHU Jun-qing, HE Liang. Extraction and quantification of pavement alligator crack morphology based on VGG16-UNet semantic segmentation model[J]. Journal of Traffic and Transportation Engineering, 2023, 23(2): 166-182. doi: 10.19818/j.cnki.1671-1637.2023.02.012
[4]	GAO Yang, CAO Wang-xin, XIA Hong-yao, ZHAO Yi-hui. Driverless vehicle positioning algorithm based on simultaneous positioning and mapping in low-visibility environment[J]. Journal of Traffic and Transportation Engineering, 2022, 22(3): 251-262. doi: 10.19818/j.cnki.1671-1637.2022.03.020
[5]	YANG Biao, YAN Guo-cheng, LIU Zhan-wen, LIU Xiao-feng. Perception of moving objects in traffic scenes based on heterogeneous graph learning[J]. Journal of Traffic and Transportation Engineering, 2022, 22(3): 238-250. doi: 10.19818/j.cnki.1671-1637.2022.03.019
[6]	CHEN Ting, YAO Da-chun, GAO Tao, QIU Hui-hui, GUO Chang-xin, LIU Zhan-wen, LI Yong-hui, BIAN Hao-yi. A fused network based on PReNet and YOLOv4 for traffic object detection in rainy environment[J]. Journal of Traffic and Transportation Engineering, 2022, 22(3): 225-237. doi: 10.19818/j.cnki.1671-1637.2022.03.018
[7]	WANG Zheng-hong, YANG Chuan. Improved SSD model in extraction application of expressway toll station locations from GaoFen 2 remote sensing image[J]. Journal of Traffic and Transportation Engineering, 2021, 21(2): 278-286. doi: 10.19818/j.cnki.1671-1637.2021.02.024
[8]	LIU Lei, ZHANG Yong, ZHANG Ming-yang, WANG Yong-ming, CHEN Jing. Analysis and optimization of ship trajectory dissimilarity models based on multi-feature fusion[J]. Journal of Traffic and Transportation Engineering, 2021, 21(5): 199-213. doi: 10.19818/j.cnki.1671-1637.2021.05.017
[9]	YANG Wei, HUANG Li-hong, ZHAO Xiang-mo, WANG Xiao. Puddle area segmentation of asphalt pavements based on FRRN attention and supervision[J]. Journal of Traffic and Transportation Engineering, 2021, 21(5): 309-322. doi: 10.19818/j.cnki.1671-1637.2021.05.026
[10]	LI Xun, LIU Yao, LI Peng-fei, ZHANG Lei, ZHAO Zheng-fan. Vehicle multi-target detection method based on YOLO v2 algorithm under darknet framework[J]. Journal of Traffic and Transportation Engineering, 2018, 18(6): 142-158. doi: 10.19818/j.cnki.1671-1637.2018.06.015
[11]	LIANG Min-jian, CUI Xiao-yu, SONG Qing-song, ZHAO Xiang-mo. Traffic sign recognition method based on HOG-Gabor feature fusion and Softmax classifier[J]. Journal of Traffic and Transportation Engineering, 2017, 17(3): 151-158.
[12]	LIU Xing-long, CHU Xiu-min, MA Feng, LEI Jin-yu. Discriminating method of abnormal dynamic information in AIS messages[J]. Journal of Traffic and Transportation Engineering, 2016, 16(5): 142-150. doi: 10.19818/j.cnki.1671-1637.2016.05.016
[13]	ZHANG Shao-yang, GE Li-juan, AN Yi-sheng, CAO Jin-shan. Research status and development of transportation data standards[J]. Journal of Traffic and Transportation Engineering, 2014, 14(2): 112-126.
[14]	LIU Qin, XU Jian-min. Coordinated control model of regional traffic signals[J]. Journal of Traffic and Transportation Engineering, 2012, 12(3): 108-112. doi: 10.19818/j.cnki.1671-1637.2012.03.016
[15]	YU Bin, WU Shan-hua, WANG Ming-hua, ZHAO Zhi-hong. K-nearest neighbor model of short-term traffic flow forecast[J]. Journal of Traffic and Transportation Engineering, 2012, 12(2): 105-111. doi: 10.19818/j.cnki.1671-1637.2012.02.015
[16]	HU Hua, GAO Yun-feng, YANG Xiao-guang. Probabilistic traffic forecast method based on comprehensive transport information platform[J]. Journal of Traffic and Transportation Engineering, 2009, 9(3): 122-126. doi: 10.19818/j.cnki.1671-1637.2009.03.024
[17]	ZHANG Da-qi, QU Shi-ru, SHI Shuang. Edge detection algorithm of moving vehicle based on sequential image motion segmentation[J]. Journal of Traffic and Transportation Engineering, 2009, 9(3): 117-121. doi: 10.19818/j.cnki.1671-1637.2009.03.023
[18]	HE Si-hua, YANG Shao-qing, SHI Ai-guo, LI Tian-wei. Ship target detection algorithm on sea surface based on block chaos feature of image sequence[J]. Journal of Traffic and Transportation Engineering, 2009, 9(1): 73-76. doi: 10.19818/j.cnki.1671-1637.2009.01.015
[19]	ZHANG Ning, HE Tie-jun, GAO Chao-hui, HUANG Wei. Detection method of traffic signs in road scenes[J]. Journal of Traffic and Transportation Engineering, 2008, 8(6): 104-109.
[20]	SHI Xin. Information organizing and transforming of value-added information system of port and shipping EDI[J]. Journal of Traffic and Transportation Engineering, 2005, 5(2): 85-88.

Supplements(0)

Cited By

Figures(9) / Tables(6)

Get Citation

PDF

XML

Article Metrics

Article views (1664) PDF downloads(147)

Multi-task perception algorithm of autonomous driving based on temporal fusion

doi: 10.19818/j.cnki.1671-1637.2021.04.017