VectorNet: Encoding HD Maps and Agent Dynamics from Vectorized Representation JiyangGao1∗ ChenSun2∗ HangZhao1 YiShen1 DragomirAnguelov1 CongcongLi1 CordeliaSchmid2 1WaymoLLC 2 GoogleResearch {jiyanggao, hangz, yshen, dragomir, congcongli}@waymo.com, {chensun, cordelias}@google.com Abstract Crosswalk Behavior prediction in dynamic, multi-agent systems is Lane Lane an important problem in the context of self-driving cars, duetothecomplexrepresentationsandinteractionsofroad components,includingmovingagents(e.g.pedestriansand vehicles) and road context information (e.g. lanes, traffic lights). This paper introduces VectorNet, a hierarchical graphneuralnetworkthatfirstexploitsthespatiallocality Agent of individual road components represented by vectors and Trajectory thenmodelsthehigh-orderinteractionsamongallcompo- Rasterized Representation Vectorized Representation nents. In contrast to most recent approaches, which ren- Figure1.Illustrationoftherasterizedrendering(left)andvector- der trajectories of moving agents and road context infor- ized approach (right) to represent high-definition map and agent trajectories. mation as bird-eye images and encode them with convolu- tionalneuralnetworks(ConvNets), ourapproachoperates onavectorrepresentation. Byoperatingonthevectorized object detection and tracking, with the scene context, pro- highdefinition(HD)mapsandagenttrajectories,weavoid videdaspriorknowledgeoftenintheformofHighDefini- lossyrenderingandcomputationallyintensiveConvNeten- tion(HD)maps. Ourgoalistobuildasystemwhichlearns coding steps. To further boost VectorNet’s capability in topredicttheintentofvehicles,whichareparameterizedas learningcontextfeatures,weproposeanovelauxiliarytask trajectories. torecovertherandomlymaskedoutmapentitiesandagent trajectories basedon their context. We evaluate VectorNet Traditional methods for behavior prediction are rule- onourin-housebehaviorpredictionbenchmarkandthere- based, where multiple behavior hypotheses are generated centlyreleasedArgoverseforecastingdataset. Ourmethod based on constraints from the road maps. More recently, achievesonparorbetterperformancethanthecompetitive manylearning-basedapproachesareproposed[5,6,10,15]; renderingapproachonbothbenchmarkswhilesavingover theyofferthebenefitofhavingprobabilisticinterpretations 70% of the model parameters with an order of magnitude ofdifferentbehaviorhypotheses,butrequirebuildingarep- reductioninFLOPs. Italsooutperformsthestateoftheart resentation to encode the map and trajectory information. ontheArgoversedataset. Interestingly,whiletheHDmapsarehighlystructured,or- ganized as entities with location (e.g. lanes) and attributes (e.g.agreentrafficlight),mostoftheseapproacheschoose 1.Introduction torendertheHDmapsascolor-codedattributes(Figure1, left),whichrequiresmanualspecifications;andencodethe This paper focuses on behavior prediction in complex scenecontextinformationwithConvNets,whichhavelim- multi-agentsystems,suchasself-drivingvehicles.Thecore ited receptive fields. This raise the question: can we learn interest is to find a unified representation which integrates ameaningfulcontextrepresentationdirectlyfromthestruc- theagentdynamics,acquiredbyperceptionsystemssuchas turedHDmaps? ∗equalcontribution. We propose to learn a unified representation for multi- 0202 yaM 8 ]VC.sc[ 1v95240.5002:viXraInput vectors Polyline subgraphs Global interaction graph Supervision & Prediction Crosswalk Map Completion Lane Lane Agent Feature Trajectory Prediction Agent Agent Figure2.AnoverviewofourproposedVectorNet. Observedagenttrajectoriesandmapfeaturesarerepresentedassequenceofvectors, andpassedtoalocalgraphnetworktoobtainpolyline-levelfeatures. Suchfeaturesarethenpassedtoafully-connectedgraphtomodel thehigher-orderinteractions. Wecomputetwotypesoflosses: predictingfuturetrajectoriesfromthenodefeaturescorrespondingtothe movingagentsandpredictingthenodefeatureswhentheirfeaturesaremaskedout. agent dynamics and structured scene context directly from supervised learning from sequential linguistic [11] and vi- theirvectorizedform(Figure1,right). Thegeographicex- sual data [27], we propose an auxiliary graph completion tentoftheroadfeaturescanbeapoint,apolygon,oracurve objective in addition to the behavior prediction objective. in geographic coordinates. For example, a lane boundary More specifically, we randomly mask out the input node containsmultiplecontrolpointsthatbuildaspline;across- featuresbelongingtoeitherscenecontextoragenttrajecto- walk is a polygon defined by several points; a stop sign is ries,andaskthemodeltoreconstructthemaskedfeatures. representedbyasinglepoint. Allthesegeographicentities The intuition is to encourage the graph networks to better can be closely approximated as polylines defined by mul- capturetheinteractionsbetweenagentdynamicsandscene tiple control points, along with their attributes. Similarly, context. Insummary,ourcontributionsare: the dynamics of moving agents can also be approximated by polylines based on their motion trajectories. All these • We are the first to demonstrate how to directly incor- polylinescanthenberepresentedassetsofvectors. poratevectorizedscenecontextandagentdynamicsin- formationforbehaviorprediction. We use graph neural networks (GNNs) to incorporate • WeproposethehierarchicalgraphnetworkVectorNet these sets of vectors. We treat each vector as a node in andthenodecompletionauxiliarytask. thegraph,andsetthenodefeaturestobethestartlocation • Weevaluatetheproposedmethodonourin-housebe- andendlocationofeachvector,alongwithotherattributes havior prediction dataset and the Argoverse dataset, suchaspolylinegroupidandsemanticlabels. Thecontext andshowthatourmethodachievesonparorbetterper- information from HD maps, along with the trajectories of formance over a competitive rendering baseline with othermovingagentsarepropagatedtothetargetagentnode 70%modelsizesavingandanorderofmagnitudere- through the GNN. We can then take the output node fea- ductioninFLOPs. Ourmethodalsoachievesthestate- ture corresponding to the target agent to decode its future of-the-artperformanceonArgoverse. trajectories. Specifically, to learn competitive representations with 2.Relatedwork GNNs,weobservethatitisimportanttoconstrainthecon- nectivities of the graph based on the spatial and semantic Behavior prediction for autonomous driving. Behavior proximity of the nodes. We therefore propose a hierarchi- prediction for moving agents has become increasingly im- cal graph architecture, where the vectors belonging to the portantforautonomousdrivingapplications[7,9,19],and samepolylineswiththesamesemanticlabelsareconnected high-fidelitymapshavebeenwidelyusedtoprovidecontext and embedded into polyline features, and all polylines are information. Forexample,IntentNet[5]proposestojointly thenfullyconnectedwitheachothertoexchangeinforma- detect vehicles and predict their trajectories from LiDAR tion. We implement the local graphs with multi-layer per- points and rendered HD maps. Hong et al. [15] assumes ceptrons,andtheglobalgraphswithself-attention[30]. An thatvehicledetectionsareprovidedandfocusesonbehavior overviewofourapproachisshowninFigure2. prediction by encoding entity interactions with ConvNets. Finally, motivated by the recent success of self- Similarly, MultiPath [6] also uses ConvNets as encoder,but adopts pre-defined trajectory anchors to regress multi- Next we present the hierarchical graph network which ag- ple possible future trajectories. PRECOG [23] attempts to gregates local information from individual polylines and capture the future stochasiticity by flow-based generative then globally over all trajectories and map features. This models.Similarto[6,15,23],wealsoassumetheagentde- graphcanthenbeusedforbehaviorprediction. tectionstobeprovidedbyanexistingperceptionalgorithm. 3.1.Representingtrajectoriesandmaps However, unlike these methods which all use ConvNets to encoderenderedroadmaps,weproposetodirectlyencode MostoftheannotationsfromanHDmapareintheform vectorizedscenecontextandagentdynamics. of splines (e.g. lanes), closed shape (e.g. regions of inter- Forecasting multi-agent interactions. Beyond the au- sections) and points (e.g. traffic lights), with additional at- tonomousdrivingdomain,thereismoregeneralinterestto tribute information such as the semantic labels of the an- predicttheintentsofinteractingagents, suchasforpedes- notations and their current states (e.g. color of the traffic trians [2, 13, 24], human activities [28] or for sports play- light, speed limit of the road). For agents, their trajecto- ers[12,26,32,33]. Inparticular,SocialLSTM[2]models riesareintheformofdirectedsplineswithrespecttotime. thetrajectoriesofindividualagentsasseparateLSTMnet- Alloftheseelementscanbeapproximatedassequencesof works, and aggregates the LSTM hidden states based on vectors: for map features, we pick a starting point and di- spatial proximity of the agents to model their interactions. rection,uniformlysamplekeypointsfromthesplinesatthe SocialGAN[13]simplifiestheinteractionmoduleandpro- same spatial distance, and sequentially connect the neigh- posesanadversarialdiscriminatortopredictdiversefutures. boringkeypointsintovectors; fortrajectories, wecanjust Sun et al. [26] combines graph networks [4] with varia- sample key points with a fixed temporal interval (0.1 sec- tional RNNs [8] to model diverse interactions. The social ond), starting from t = 0, and connect them into vectors. interactionscanalsobeinferredfromdata: Kipfetal.[18] Givensmallenoughspatialortemporalintervals,theresult- treatssuchinteractionsaslatentvariables;andgraphatten- ingpolylinesserveascloseapproximationsoftheoriginal tion networks [16, 31] apply self-attention mechanism to mapandtrajectories. weighttheedgesinapre-definedgraph. Ourmethodgoes Our vectorization process is a one-to-one mapping be- one step further by proposing a unified hierarchical graph tweencontinuoustrajectories,mapannotationsandthevec- networktojointlymodeltheinteractionsofmultipleagents, tor set, although the latter is unordered. This allows us to andtheirinteractionswiththeentitiesfromroadmaps. formagraphrepresentationontopofthevectorsets,which Representationlearningforsetsofentities. Traditionally can be encoded by graph neural networks. More specifi- machineperceptionalgorithmshavebeenfocusingonhigh- cally,wetreateachvectorv belongingtoapolylineP as i j dimensional continuous signals, such as images, videos or anodeinthegraphwithnodefeaturesgivenby audios. One exception is 3D perception, where the inputs v =[ds,de,a ,j], (1) are usually in the form of unordered point sets, given by i i i i depth sensors. For example, Qi et al. propose the Point- whereds i andde i arecoordinatesofthestartandendpoints Net model [20] and PointNet++ [21] to apply permutation of the vector, d itself can be represented as (x,y) for 2D invariantoperations(e.g.maxpooling)onlearnedpointem- coordinatesor(x,y,z)for3Dcoordinates;a i corresponds beddings. Unlikepointsets,entitiesonHDmapsandagent toattributefeatures,suchasobjecttype,timestampsfortra- trajectories form closed shapes or are directed, and they jectories, orroadfeaturetypeorspeedlimitforlanes; j is mayalsobeassociatedwithattributeinformation.Wethere- theintegeridofP j,indicatingv i ∈P j. foreproposetokeepsuchinformationbyvectorizingthein- Tomaketheinputnodefeaturesinvarianttothelocations puts,andencodetheattributesasnodefeaturesinagraph. oftargetagents,wenormalizethecoordinatesofallvectors Self-supervisedcontextmodeling. Recently,manyworks tobecenteredaroundthelocationoftargetagentatitslast intheNLPdomainhaveproposedmodelinglanguagecon- observedtimestep. Afutureworkistosharethecoordinate textinaself-supervisedfashion[11,22]. Theirlearnedrep- centersforallinteractingagents,suchthattheirtrajectories resentations achieve significant performance improvement canbepredictedinparallel. when transferred to downstream tasks. Inspired by these 3.2.Constructingthepolylinesubgraphs methods, we propose an auxiliary loss for graph represen- tations, which learns to predict the missing node features Toexploitthespatialandsemanticlocalityofthenodes, fromitsneighbors. Thegoalistoincentivizethemodelto we take a hierarchical approach by first constructing sub- bettercaptureinteractionsamongnodes. graphsatthevectorlevel,whereallvectornodesbelonging to the same polyline are connected with each other. Con- 3.VectorNetapproach sidering a polyline P with its nodes {v ,v ,...,v }, we 1 2 P defineasinglelayerofsubgraphpropagationoperationas ThissectionintroducesourVectorNetapproach.Wefirst (cid:16) (cid:16)(cid:110) (cid:111)(cid:17)(cid:17) describehowtovectorizeagenttrajectoriesandHDmaps. v i(l+1) =ϕ rel g enc(v i(l)),ϕ agg g enc(v j(l)) (2)where {p(l)} is the set of polyline node features, GNN(·) Output Node i Features correspondstoasinglelayerofagraphneuralnetwork,and A corresponds to the adjacency matrix for the set of poly- linenodes. Concat The adjacency matrix A can be provided a heuristic, such as using the spatial distances [2] between the nodes. Permutation Forsimplicity,weassumeAtobeafully-connectedgraph. Invariant Aggregator Ourgraphnetworkisimplementedasaself-attentionoper- ation[30]: GNN(P)=softmax(cid:0) P PT(cid:1) P (5) Node Encoder Q K V where P is the node feature matrix and P , P and P Q K V Input Node areitslinearprojections. Features We then decode the future trajectories from the nodes Figure3.Thecomputationflowonthevectornodesofthesame correspondingthemovingagents: polyline. (cid:16) (cid:17) vfuture =ϕ p(Lt) (6) i traj i wherev(l) isthenodefeatureforl-thlayerofthesubgraph i whereL isthenumberofthetotalnumberofGNNlayers, t network,andv i(0) istheinputfeaturesv i. Functiong enc(·) andϕ traj(·)isthetrajectorydecoder. Forsimplicity,weuse transformstheindividualnodefeatures, ϕ agg(·)aggregates anMLPasthedecoderfunction. Moreadvanceddecoders, the information from all neighboring nodes, and ϕ rel(·) is such as the anchor-based approach from MultiPath [6], or therelationaloperatorbetweennodev ianditsneighbors. variational RNNs [8, 26] can be used to generate diverse In practice, g enc(·) is a multi-layer perceptron (MLP) trajectories;thesedecodersarecomplementarytoourinput whose weights are shared over all nodes; specifically, encoder. the MLP contains a single fully connected layer followed We use a single GNN layer in our implementation, so by layer normalization [3] and then ReLU non-linearity. that during inference time, only the node features corre- ϕ agg(·) is the maxpooling operation, and ϕ rel(·) is a sim- sponding to the target agents need to be computed. How- pleconcatenation. AnillustrationisshowninFigure3. We ever,wecanalsostackmultiplelayersofGNN(·)tomodel stack multiple layers of the subgraph networks, where the higher-orderinteractionswhenneeded. weightsforg enc(·)aredifferent. Finally,toobtainpolyline Toencourageourglobalinteractiongraphtobettercap- levelfeatures,wecompute tureinteractionsamongdifferenttrajectoriesandmappoly- lines, we introduce an auxiliary graph completion task. (cid:16)(cid:110) (cid:111)(cid:17) p=ϕ v(Lp) (3) During training time, we randomly mask out the features agg i forasubsetofpolylinenodes, e.g.p . Wethenattemptto i whereϕ (·)isagainmaxpooling. recoveritsmaskedoutfeatureas: agg Our polyline subgraph network can be seen as a gener- (cid:16) (cid:17) alizationofPointNet[20]: whenwesetds = de andleta pˆ i =ϕ node p i(Lt) (7) andltobeempty,ournetworkhasthesameinputsandcom- whereϕ (·)isthenodefeaturedecoderimplementedas pute flow as PointNet. However, by embedding the order- node an MLP. These node feature decoders are not used during ing information into vectors, constraining the connectivity inferencetime. ofsubgraphsbasedonthepolylinegroupings,andencoding Recall that p is a node from a fully-connected, un- attributes as node features, our method is particularly suit- i ordered graph. In order to identify an individual polyline abletoencodestructuredmapannotationsandagenttrajec- node when its corresponding feature is masked out, we tories. computetheminimumvaluesofthestartcoordinatesfrom 3.3.Globalgraphforhigh-orderinteractions all of its belonging vectors to obtain the identifier embed- dingpid. Theinputsnodefeaturesthenbecome We now consider modeling the high-order interactions i onthepolylinenodefeatures{p 1,p 2,...,p P}withaglobal p(0) =(cid:2) p ;pid(cid:3) (8) interactiongraph: i i i Ourgraphcompletionobjectiveiscloselyrelatedtothe (cid:110) (cid:111) (cid:16)(cid:110) (cid:111) (cid:17) p(l+1) =GNN p(l) ,A (4) widely successful BERT [11] method for natural language i iprocessing, which predicts missing tokens based on bidi- mapinformation. Thefuturetrajectoriesofthetestsetare rectionalcontextfromdiscreteandsequentialtextdata. We held out. Unless otherwise mentioned, our ablation study generalize this training objective to work with unordered reportsperformanceonthevalidationset. graphs.Unlikeseveralrecentmethods(e.g.[25])thatgener- In-house dataset is a large-scale dataset collected for be- alizestheBERTobjectivetounorderedimagepatcheswith haviorprediction.ItcontainsHDmapdata,boundingboxes pre-computedvisualfeatures, ournodefeaturesarejointly andtracksobtainedwithanautomaticin-houseperception optimizedinanend-to-endframework. system, and manually labeled vehicle trajectories. The to- tal number of vehicle trajectories are 2.2M and 0.55M for 3.4.Overallframework train and test sets. Each trajectory has a length of 4 sec- onds,wherethe(0,1]secondisthehistorytrajectoryused Once the hierarchical graph network is constructed, we asobservation, and(1, 4]secondsarethetargetfuturetra- optimizeforthemulti-tasktrainingobjective jectoriestobeevaluated. Thetrajectoriesaresampledfrom real world vehicles’ behaviors, including stationary, going L=L +αL (9) traj node straight, turning, lane change and reversing, and roughly where L is the negative Gaussian log-likelihood for preserves the natural distribution of driving scenarios. For traj the groundtruth future trajectories, L is the Huber loss theHDmapfeatures,weincludelaneboundaries,stop/yield node between predicted node features and groundtruth masked signs,crosswalksandspeedbumps. nodefeatures,andα=1.0isascalarthatbalancesthetwo For both datasets, the input history trajectories are de- lossterms. ToavoidtrivialsolutionsforL bylowering rivedfromautomaticperceptionsystemsandarethusnoisy. node themagnitudeofnodefeatures,weL2normalizethepoly- Argoverse’sfuturetrajectoriesarealsomachinegenerated, line node features before feeding them to the global graph whileIn-househasmanuallylabeledfuturetrajectories. network. Our predicted trajectories are parameterized as per-step 4.1.2 Metrics coordinateoffsets,startingfromthelastobservedlocation. Werotatethecoordinatesystembasedontheheadingofthe ForevaluationweadoptthewidelyusedAverageDisplace- targetvehicleatthelastobservedlocation. ment Error (ADE) computed over the entire trajectories and the Displacement Error at t (DE@ts) metric, where 4.Experiments t ∈ {1.0,2.0,3.0} seconds. The displacements are mea- suredinmeters. In this section, we first describe the experimental set- tings,includingthedatasets,metricsandrasterized+Con- 4.1.3 Baselinewithrasterizedimages vNets baseline. Secondly, comprehensive ablation studies are done for both the rasterized baseline and VectorNet. We render N consecutive past frames, where N is 10 for Thirdly, wecompareanddiscussthecomputationcost, in- thein-housedatasetand20fortheArgoversedataset. Each cludingFLOPsandnumberofparameters.Finally,wecom- frame is a 400×400×3 image, which has road map infor- paretheperformancewithstate-of-the-artmethods. mationandthedetectedobjectboundingboxes. 400pixels correspond to 100 meters in the in-house dataset, and 130 4.1.Experimentalsetup metersintheArgoversedataset. Renderingisbasedonthe position of self-driving vehicle in the last observed frame; 4.1.1 Datasets theself-drivingvehicleisplacedatthecoordinatelocation Wereportresultsontwovehiclebehaviorpredictionbench- (200, 320) in in-house dataset, and (200, 200) in Argov- marks, the recently released Argoverse dataset [7] and our erse dataset. All N frames are stacked together to form a in-housebehaviorpredictiondataset. 400×400×3Nimageasmodelinput. Argoversemotionforecasting[7]isadatasetdesignedfor Our baseline uses a ConvNet to encode the rasterized vehiclebehaviorpredictionwithtrajectoryhistories. There images,whosearchitectureiscomparabletoIntentNet[5]: are333K5-secondlongsequencessplitinto211Ktraining, we use a ResNet-18 [14] as the ConvNet backbone. Un- 41Kvalidationand80Ktestingsequences. Thecreatorscu- likeIntentNet,wedonotusetheLiDARinputs. Toobtain ratedthisdatasetbymininginterestinganddiversescenar- vehicle-centric features, we crop the feature patch around ios,suchasyieldingforamergingvehicle,crossinganin- the target vehicle from the convolutional feature map, and tersection, etc. The trajectories are sampled at 10Hz, with average pool over all the spatial locations of the cropped (0,2]secondsareusedasobservationand(2,5]secondsfor feature map to get a single vehicle feature vector. We em- trajectory prediction. Each sequence has one “interesting” pirically observe that using a deeper ResNet model or ro- agent whose trajectory is the prediction target. In addition tatingthecroppedfeaturesbasedontargetvehicleheadings tovehicletrajectories,eachsequenceisalsoassociatedwith donotleadtobetterperformance. Thevehiclefeaturesarethenfedintoafullyconnectedlayer(asusedbyIntentNet) alsocomparedifferentcroppingmethods,byincreasingthe to predict the future coordinates in parallel. The model is cropsizeorcroppingalongthevehicletrajectoryatallob- optimized on 8 GPUs with synchronous training. We use servedtimesteps. Fromthe3rdto6throwsofTable1we theAdamoptimizer[17]anddecaythelearningrateevery can see that a larger crop size (3 v.s. 1) can significantly 5 epochs bya factor of 0.3. We train themodel for a total improvetheperformance,andcroppingalongobservedtra- of25epochswithaninitiallearningrateof0.001. jectory also leads to better performance. This observation To test how convolutional receptive fields and feature confirmstheimportanceofreceptivefieldswhenrasterized cropping strategies influence the performance, we conduct images are used as inputs. It also highlights its limitation, ablationstudyonthenetworkreceptivefield,featurecrop- whereacarefullydesignedcroppingstrategyisneeded,of- pingstrategyandinputimageresolutions. tenatthecostofincreasedcomputationcost. Impactofrenderingresolution. Wefurthervarythereso- lutionsofrasterizedimagestoseehowitaffectsthepredic- 4.1.4 VectorNetwithvectorizedrepresentations tionqualityandcomputationcost,asshowninthefirstthree To ensure a fair comparison, the vectorized representation rowsofTable1. Wetestthreedifferentresolutions,includ- takesasinputthesameinformationastherasterizedrepre- ing400×400(0.25meterperpixel),200×200(0.5meter sentation. Specifically, we extract exactly the same set of perpixel)and100×100(1meterperpixel). Itcanbeseen mapfeaturesaswhenrendering.Wealsomakesurethatthe that the performance increases generally as the resolution visible road feature vectors for a target agent are the same goesup.However,fortheArgoversedatasetwecanseethat asintherasterizedrepresentation. However,thevectorized increasingtheresolutionfrom200×200to400×400leads representationdoesenjoythebenefitofincorporatingmore to slight drop in performance, which can be explained by complexroadfeatureswhicharenon-trivialtorender. the decrease of effective receptive field size with the fixed Unless otherwise mentioned, we use three graph lay- 3×3kernel. Wediscusstheimpactoncomputationcostof ersforthe polylinesubgraphs, andonegraphlayer for the thesedesignchoicesinSection4.4. globalinteractiongraph. Thenumberofhiddenunitsinall 4.3.AblationstudyforVectorNet MLPsarefixedto64. TheMLPsarefollowedbylayernor- malization and ReLU nonlinearity. We normalize the vec- Impactofinputnodetypes. Westudywhetheritishelp- torcoordinatestobecenteredaroundthelocationoftarget ful to incorporate both map features and agent trajecto- vehicleatthelastobservedtimestep. Similartotheraster- ries for VectorNet. The first three rows in Table 2 corre- izedmodel,VectorNetistrainedon8GPUssynchronously spond to using only the past trajectory of the target vehi- withAdamoptimizer. Thelearningrateisdecayedevery5 cle (“none” context), adding only map polylines (“map”), epochsbyafactorof0.3, wetrainthemodelforatotalof and finally adding trajectory polylines (“map + agents”). 25epochswithinitiallearningrateof0.001. We can clearly observe that adding map information sig- Tounderstandtheimpactofthecomponentsontheper- nificantly improves the trajectory prediction performance. formanceofVectorNet,weconductablationstudiesonthe Incorporating trajectory information furthers improves the type of context information, i.e. whether to use only map performance. oralsothetrajectoriesofotheragentsaswellastheimpact Impactofnodecompletionloss. ThelastfourrowsofTa- of number of graph layers for the polyline subgraphs and ble 2 compares the impact of adding the node completion globalinteractiongraphs. auxiliary objective. We can see that adding this objective consistently helps with performance, especially at longer 4.2.AblationstudyfortheConvNetbaseline timehorizons. We conduct ablation studies on the impact of ConvNet Impact on the graph architectures. In Table 3 we study receptivefields,featurecroppingstrategies,andtheresolu- theimpactofdepthsandwidthsofthegraphlayersontra- tionoftherasterizedimages. jectory prediction performance. We observe that for the Impactofreceptivefields.Asbehaviorpredictionoftenre- polyline subgraph three layers gives the best performance, quirescapturinglongrangeroadcontext,theconvolutional and for the global graph just one layer is needed. Making receptivefieldcouldbecriticaltothepredictionquality.We the MLPs wider does not lead to better performance, and evaluatedifferentvariantstoseehowtwokeyfactorsofre- hurts for Argoverse, presumably because it has a smaller ceptive fields, convolutional kernel sizes and feature crop- trainingdataset. Someexamplevisualizationsonpredicted ping strategies, affect the prediction performance. The re- trajectoryandlaneattentionareshowninFigure4. sults are shown in Table 1. By comparing kernel size 3, 5 Comparison with ConvNets. Finally, we compare our and7at400×400resolution,wecanseethatalargerkernel VectorNetwiththebestConvNetmodelinTable4. Forthe sizeleadstoslightperformanceimprovement. However,it in-house dataset, our model achieves on par performance alsoleadstoquadraticincreaseofthecomputationcost.We with the best ResNet model, while being much more eco-Resolution Kernel Crop In-housedataset Argoversedataset DE@1s DE@2s DE@3s ADE DE@1s DE@2s DE@3s ADE 100×100 3×3 1×1 0.63 0.94 1.32 0.82 1.14 2.80 5.19 2.21 200×200 3×3 1×1 0.57 0.86 1.21 0.75 1.11 2.72 4.96 2.15 400×400 3×3 1×1 0.55 0.82 1.16 0.72 1.12 2.72 4.94 2.16 400×400 3×3 3×3 0.50 0.77 1.09 0.68 1.09 2.62 4.81 2.08 400×400 3×3 5×5 0.50 0.76 1.08 0.67 1.09 2.60 4.70 2.08 400×400 3×3 traj 0.47 0.71 1.00 0.63 1.05 2.48 4.49 1.96 400×400 5×5 1×1 0.54 0.81 1.16 0.72 1.10 2.63 4.75 2.13 400×400 7×7 1×1 0.53 0.81 1.16 0.72 1.10 2.63 4.74 2.13 Table1.Impactofreceptivefield(ascontrolledbyconvolutionalkernelsizeandcropstrategy)andrenderingresolutionfortheConvNet baseline.WereportDEandADE(inmeters)onboththein-housedatasetandtheArgoversedataset. Context NodeCompl. In-housedataset Argoversedataset DE@1s DE@2s DE@3s ADE DE@1s DE@2s DE@3s ADE none - 0.77 0.99 1.29 0.92 1.29 2.98 5.24 2.36 map no 0.57 0.81 1.11 0.72 0.95 2.18 3.94 1.75 map+agents no 0.55 0.78 1.05 0.70 0.94 2.14 3.84 1.72 map yes 0.55 0.78 1.07 0.70 0.94 2.11 3.77 1.70 map+agents yes 0.53 0.74 1.00 0.66 0.92 2.06 3.67 1.66 Table2.AblationstudiesforVectorNetwithdifferentinputnodetypesandtrainingobjectives.Here“map”referstotheinputvectorsfrom theHDmaps,and“agents”referstotheinputvectorsfromthetrajectoriesofnon-targetvehicles. When“NodeCompl.” isenabled,the modelistrainedwiththegraphcompletionobjectiveinadditiontotrajectoryprediction.DEandADEarereportedinmeters. PolylineSubgraph GlobalGraph DE@3s Model FLOPs #Param DE@3s Depth Width Depth Width In-house Argoverse In-house Argo 1 64 1 64 1.09 3.89 R18-k3-c1-r100 0.66G 246K 1.32 5.19 3 64 1 64 1.00 3.67 R18-k3-c1-r200 2.64G 246K 1.21 4.95 3 128 1 64 1.00 3.93 R18-k3-c1-r400 10.56G 246K 1.16 4.96 3 64 2 64 0.99 3.69 R18-k5-c1-r400 15.81G 509K 1.16 4.75 3 64 2 256 1.02 3.69 R18-k7-c1-r400 23.67G 902K 1.16 4.74 Table3.Ablationonthedepthandwidthofpolylinesubgraphand R18-k3-c3-r400 10.56G 246K 1.09 4.81 globalgraph. Thedepthofpolylinesubgraphhasbiggestimpact R18-k3-c5-r400 10.56G 246K 1.08 4.70 onDE@3s. R18-k3-t-r400 10.56G 246K 1.00 4.49 VectorNetw/oaux. 0.041G×n 72K 1.05 3.84 VectorNetwaux. 0.041G×n 72K 1.00 3.67 nomically in terms of model size and FLOPs. For the Ar- Table4.ModelFLOPsandnumberofparameterscomparisonfor goversedataset,ourapproachsignificantlyoutperformsthe ResNetandVectorNet.R18-kM-cN-rSstandsfortheResNet-18 bestConvNetmodelwith12%reductioninDE@3. Weob- modelwithkernelsizeM×M,croppatchsizeN×N andinput serve that the in-house dataset contains a lot of stationary resolution S ×S. Prediction decoder is not counted for FLOPs vehiclesduetoitsnaturaldistributionofdrivingscenarios; andparameters. those cases can be easily solved by ConvNets, which are goodatcapturinglocalpattern. However,fortheArgoverse dataset where only “interesting” cases are preserved, Vec- callywiththekernelsizeandinputimagesize;thenumber torNet outperforms the best ConvNet baseline by a large of parameters increases quadratically with the kernel size. margin;presumablyduetoitsabilitytocapturelongrange Aswerendertheimagescenteredattheselfdrivingvehicle, contextinformationviathehierarchicalgraphnetwork. thefeaturemapcanbereusedamongmultipletargets,sothe FLOPs of the backbone part is a constant number. How- 4.4.ComparisonofFLOPsandmodelsize ever,iftherenderedimagesaretarget-centered,theFLOPs We now compare the FLOPs and model size between increases linearly with the number of targets. For Vector- ConvNetsandVectorNet,andtheirimplicationsonperfor- Net,theFLOPsdependsonthenumberofvectornodesand mance.TheresultsareshowninTable4.Thepredictionde- polylinesinthescene. Forthein-housedataset,theaverage coderisnotcountedforFLOPsandnumberofparameters. numberofroadmappolylinesis17containing205vectors; WecanseethattheFLOPsofConvNetsincreasequadrati- the average number of road agent polylines is 59 contain-Model DE@3s ADE ConstantVelocity[7] 7.89 3.53 NearestNeighbor[7] 7.88 3.45 LSTMED[7] 4.95 2.15 ChallengeWinner: uulm-mrm 4.19 1.90 ChallengeWinner: Jean 4.17 1.86 VectorNet 4.01 1.81 Table5.TrajectorypredictionperformanceontheArgoverseFore- castingtestsetwhennumberofsampledtrajectoriesK=1.Results wereretrievedfromtheArgoverseleaderboard[1]on03/18/2020. Comparing R18-k3-t-r400 (the best model among Con- vNets)withVectorNet,VectorNetsignificantlyoutperforms ConvNets. For computation, ConvNets consumes 200+ timesmoreFLOPsthanVectorNet(10.56Gvs0.041G)for a single agent; considering that the average number of ve- hicles in a scene is around 30 (counted from the in-house dataset),theactualcomputationconsumptionofVectorNet is still much smaller than that of ConvNets. At the same time,VectorNetneeds29%oftheparametersofConvNets (72K vs 246K). Based on the comparison, we can see that VectorNetcansignificantlyboosttheperformancewhileat thesametimedramaticallyreducingcomputationcost. 4.5.Comparisonwithstate-of-the-artmethods Finally,wecompareVectorNetwithseveralbaselineap- proaches [7] and some state-of-the-art methods on the Ar- goverse[7]testset. WereportK=1results(themostlikely predictions)inTable5.Thebaselineapproachesincludethe constant velocity baseline, nearest neighbor retrieval, and LSTM encoder-decoder. The state-of-the-art approaches arethewinnersofArgoverseForecastingChallenge. Itcan beseenthatVectorNetimprovesthestate-of-the-artperfor- mancefrom4.17to4.01fortheDE@3smetricwhenK=1. 5.Conclusionandfuturework Figure4.(Left)Visualizationoftheprediction:lanesareshownin WeproposedtorepresenttheHDmapandagentdynam- grey,non-targetagentsaregreen,targetagent’sgroundtruthtra- ics with a vectorized representation. We designed a novel jectoryisinpink,predictedtrajectoryinblue. (Right)Visualiza- hierarchical graph network, where the first level aggre- tionofattentionforroadandagent:Brighterredcolorcorresponds gatesinformationamongvectorsinsideapolyline,andthe tohigherattentionscore. Itcanbeseenthatwhenagentsarefac- second level models the higher-order relationships among ingmultiplechoices(firsttwoexamples),theattentionmechanism isabletofocusonthecorrectchoices(tworight-turnlanesinthe polylines. Experiments on the large scale in-house dataset secondexample).Thethirdexampleisalane-changingagent,the and the public available Argoverse dataset show that the attendedlanesarethecurrentlaneandtargetlane. Inthefourth proposed VectorNet outperforms the ConvNet counterpart example, thoughthepredictionisnotaccurate, theattentionstill whileatthesametimereducingthecomputationalcostby producesareasonablescoreonthecorrectlane. alargemargin.VectorNetalsoachievesstate-of-the-artper- formance(DE@3s,K=1)ontheArgoversetestset. Anat- uralnextstepistoincorporatetheVectorNetencoderwith ing 590 vectors. We calculate the FLOPs based on these a multi-modal trajectory decoder (e.g. [6, 29]) to generate averagenumbers. Notethat,asweneedtore-normalizethe diversefuturetrajectories. vector coordinates and re-compute the VectorNet features foreachtarget, theFLOPsincreaselinearlywiththenum- Acknowledgement. WewanttothankBenjaminSappand berofpredictingtargets(ninTable4). YuningChaifortheirhelpfulcommentsonthepaper.References [21] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a [1] ArgoverseMotionForecastingCompetition,2019. metricspace.InNIPS,2017. https://evalai.cloudcv.org/web/challenges/ [22] AlecRadford,JeffWu,RewonChild,DavidLuan,DarioAmodei, challenge-page/454/leaderboard/1279. andIlyaSutskever. Languagemodelsareunsupervisedmultitask [2] AlexandreAlahi, KratarthGoel, VigneshRamanathan, Alexandre learners.2019. Robicquet,LiFei-Fei,andSilvioSavarese. SocialLSTM:Human [23] Nicholas Rhinehart, Rowan McAllister, Kris Kitani, and Sergey TrajectoryPredictioninCrowdedSpaces.InCVPR,2016. Levine. PRECOG:Predictionconditionedongoalsinvisualmulti- [3] JimmyLeiBa,JamieRyanKiros,andGeoffreyEHinton. Layer agentsettings.InICCV,2019. normalization.arXivpreprintarXiv:1607.06450,2016. [24] AlexandreRobicquet,AmirSadeghian,AlexandreAlahi,andSilvio [4] Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro Savarese.Learningsocialetiquette:Humantrajectoryunderstanding Sanchez-Gonzalez, Vinicius Zambaldi, Mateusz Malinowski, An- incrowdedscenes.InECCV,2016. drea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, [25] WeijieSu,XizhouZhu,YueCao,BinLi,LeweiLu,FuruWei,and Caglar Gulcehre, Francis Song, Andrew Ballard, Justin Gilmer, JifengDai. Vl-bert: Pre-trainingofgenericvisual-linguisticrepre- George Dahl, Ashish Vaswani, Kelsey Allen, Charles Nash, Vic- sentations.arXivpreprintarXiv:1908.08530,2019. toria Langston, Chris Dyer, Nicolas Heess, Daan Wierstra, Push- [26] Chen Sun, Per Karlsson, Jiajun Wu, Joshua B Tenenbaum, and meet Kohli, Matt Botvinick, Oriol Vinyals, Yujia Li, and Razvan Kevin Murphy. Stochastic prediction of multi-agent interactions Pascanu. Relationalinductivebiases,deeplearning,andgraphnet- frompartialobservations.InICLR,2019. works.arXivpreprintarXiv:1806.01261,2018. [27] Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and [5] SergioCasas,WenjieLuo,andRaquelUrtasun.Intentnet:Learning Cordelia Schmid. VideoBERT: A joint model for video and lan- topredictintentionfromrawsensordata.InCoRL,2018. guagerepresentationlearning.InICCV,2019. [6] Yuning Chai, Benjamin Sapp, Mayank Bansal, and Dragomir [28] ChenSun,AbhinavShrivastava,CarlVondrick,RahulSukthankar, Anguelov. Multipath: Multipleprobabilisticanchortrajectoryhy- KevinMurphy,andCordeliaSchmid.Relationalactionforecasting. pothesesforbehaviorprediction.InCoRL,2019. InCVPR,2019. [7] Ming-FangChang,JohnLambert,PatsornSangkloy,JagjeetSingh, [29] CharlieTangandRussRSalakhutdinov.Multiplefuturesprediction. Slawomir Bak, Andrew Hartnett, De Wang, Peter Carr, Simon InNeurIPS.2019. Lucey,DevaRamanan,etal.Argoverse:3Dtrackingandforecasting [30] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, withrichmaps.InCVPR,2019. LlionJones,AidanNGomez,ŁukaszKaiser,andIlliaPolosukhin. [8] Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Attentionisallyouneed.InNIPS,2017. AaronCCourville,andYoshuaBengio. Arecurrentlatentvariable [31] Petar Velicˇkovic´, Guillem Cucurull, Arantxa Casanova, Adriana modelforsequentialdata.InNeurIPS,2015. Romero,PietroLio`,andYoshuaBengio. Graphattentionnetworks. [9] JamesColyarandHalkiasJohn. Ushighway101dataset. FHWA- InICLR,2018. HRT-07-030,2007. [32] Raymond A. Yeh, Alexander G. Schwing, Jonathan Huang, and [10] Henggang Cui, Vladan Radosavljevic, Fang-Chieh Chou, Tsung- Kevin Murphy. Diverse generation for multi-agent sports games. HanLin,ThiNguyen,Tzu-KuoHuang,JeffSchneider,andNemanja InCVPR,2019. Djuric. Multimodaltrajectorypredictionsforautonomousdriving [33] Eric Zhan, Stephan Zheng, Yisong Yue, Long Sha, and usingdeepconvolutionalnetworks.InICRA,2019. Patrick Lucey. Generative multi-agent behavioral cloning. [11] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina arXiv:1803.07612,2018. Toutanova. BERT: Pre-training of deep bidirectional transform- ersforlanguageunderstanding. arXivpreprintarXiv:1810.04805, 2018. [12] PannaFelsen,PulkitAgrawal,andJitendraMalik.Whatwillhappen next?forecastingplayermovesinsportsvideos.InICCV,2017. [13] Agrim Gupta, Justin Johnson, Li Fei-Fei, Silvio Savarese, and AlexandreAlahi.SocialGAN:Sociallyacceptabletrajectorieswith generativeadversarialnetworks.InCVPR,2018. [14] KaimingHe,XiangyuZhang,ShaoqingRen,andJianSun. Deep residuallearningforimagerecognition.InCVPR,2016. [15] JoeyHong,BenjaminSapp,andJamesPhilbin. Rulesoftheroad: Predictingdrivingbehaviorwithaconvolutionalmodelofsemantic interactions.InCVPR,2019. [16] YedidHoshen. VAIN:Attentionalmulti-agentpredictivemodeling. arXivpreprintarXiv:1706.06122,2017. [17] DiederikPKingmaandJimmyBa. Adam:Amethodforstochastic optimization.arXivpreprintarXiv:1412.6980,2014. [18] ThomasKipf,EthanFetaya,Kuan-ChiehWang,MaxWelling,and RichardZemel. Neuralrelationalinferenceforinteractingsystems. InICML,2018. [19] RobertKrajewski,JulianBock,LaurentKloeker,andLutzEckstein. Thehighddataset: Adronedatasetofnaturalisticvehicletrajecto- riesongermanhighwaysforvalidationofhighlyautomateddriving systems.InITSC,2018. [20] CharlesRQi,HaoSu,KaichunMo,andLeonidasJGuibas. Point- net:Deeplearningonpointsetsfor3dclassificationandsegmenta- tion.InCVPR,2017.