Tuesday, June 4, 2024

Oracle Spatial - Millions of Points in Thousands of Polygons

I wrote about Oracle Spatial in the past. (Intro, data loading Points, Lines, Polygons). Now I had a new challenge.

We had a POC. Customer wanted to test performance of 7 million data points in 3200 polygons. In the POC I learned there is difference in performance, based on the way we create the query. I was lucky to gain some guidance from Daniel Geringer (who wrote this guide) of Oracle Spatial. Bellow is the summary of the process. It was done on Autonomous Database (ADW), but should be similar on any Oracle DB.

First part will cover the performance benefits from using the proper function. Second, some advisable data setting and cleaning steps we did. 


Part 1 

Customer loaded ~7 million data points (lng/lat) of USA accidents from a similar data source and loaded to ADW. Loaded ~3200 US counties from this or similar source

We could update the longitude and latitude (lng/lat) data and place them in a new spatial format column. For many reasons we preferred not to.  We worked directly with the lng/lat points and the get_geometry function (see page 25 here or slightly longer version here). Next created index (see pages 26, 27 here)... and fixed some inconsistencies (see second part, bellow).

My default intention was to use the SDO_CONTAINS function. Instead of SDO_CONTAINS, Daniel suggested top use  SDO_ANYINTERACT or SDO_INSIDE instead and to switch the order of the geometry parameter to the spatial operator. 

SDO_ANYINTERACT can be used if it's ok to return points on the polygon boundary.  If the requirement is only to return points inside the polygon, SDO_INSIDE should be used. In switching the geometry order, there is a huge difference between running 3200 times comparisons for 7 million points and running 7 million times comparisons for 3200 polygons. In our tests it was ~390 seconds vs. ~5 seconds.

In all cases the following format was used:

select count(*) from accident a, counties c where SDO_... ='TRUE';

Let's compare the following:

  • SDO_CONTAINS(C.GEOMETRY, GET_GEOMETRY(LNG,LAT)) = 'TRUE'
  • SDO_ANYINTERACT(GET_GEOMETRY(LNG,LAT),c.geometry)='TRUE'
  • SDO_ANYINTERACT(c.geometry,GET_GEOMETRY(LNG,LAT))='TRUE'
  • SDO_INSIDE(GET_GEOMETRY(LNG,LAT),c.geometry)='TRUE'

In all cases we had level 8 of parallel execution:

  • SDO_CONTAINS(C.GEOMETRY, GET_GEOMETRY(LNG,LAT)) = 'TRUE' returned in 387.7 sec.
  • SDO_ANYINTERACT(GET_GEOMETRY(LNG,LAT),c.geometry)='TRUE' returned in 4.8 sec.
  • SDO_ANYINTERACT(c.geometry,GET_GEOMETRY(LNG,LAT))='TRUE' returned in 389.2 sec.
  • SDO_INSIDE(GET_GEOMETRY(LNG,LAT),c.geometry)='TRUE' returned in 5.0 sec.

This is 80 times faster!

Both SDO_INSIDE and SDO_CONTAINS return the same number (6,973,650) while both SDO_ANYINERACT return slightly higher number (6,973,846).


So we have 2 winners that can use the better parameters order. 

I will focus on this option:

select count(*) 
from accident a, counties c
where SDO_ANYINTERACT(GET_GEOMETRY(LNG,LAT),c.geometry)='TRUE';


I wanted to see how does the parallelism influence this query performance. I rerun it with various combinations using ADW High connection (bellow 8 ECPU explain plan): 



The results are impressive:



What happens if we increase the amount of data? 

I duplicated the original data 21 times, while moving each data points randomly by 10-100 meters each time towards the center of USA.  I didn't want duplicated locations for the test. 

The result was a table of over 146 million locations. I used similar query and tested it with various amounts of ECPU. Here are the results:







Part 2 

For better performance , before execution, it is advisable to:

  • Validate geometry (fix if needed)
  • Create Index

We validated the Counties table

select name,  sdo_geom.validate_geometry_with_context(GEOMETRY, 0.005) status
from  COUNTIES

and received lots of errors of the type:  13367 [Element <1>] [Ring <1>]

Daniel explained that it is probably due to polygon directionality. It was solved by running the sdo_util.rectify_geometry like this:

update counties set geometry=sdo_util.rectify_geometry(geometry,0.005);

Good code for similar problems is in the blog post by Hans Viehmann here


See also the Spatial Data Validation chapter (slides 3-8) in the  Oracle Spatial Best Practices and Tuning Tips by Daniel Geringer here.


Last part was the index creation. While I covered index creation for  PointsLinesPolygons earlier, I learned there is performance benefit in creating CBTREE organized index for streaming point data. See slides 29, 30 here

Our code was:

INSERT INTO USER_SDO_GEOM_METADATA VALUES (
  'ACCIDENTS',  -- table name
  'ADMIN.GET_GEOMETRY(LNG,LAT)',    -- geometry column name
  SDO_DIM_ARRAY(
    SDO_DIM_ELEMENT('X', -180, 180, 0.005),  -- 'X',[min x],[max x],[tolerance]
    SDO_DIM_ELEMENT('Y', -90, 90, 0.005)),    -- 'Y',[min y],[max y],[tolerance]
   4326           -- indicates longitude/latitude coordinates (coordinate system)
 );


CREATE INDEX ACCIDENTS ON ACCIDENTS_2023_03 ( ADMIN.GET_GEOMETRY(LNG,LAT) )
INDEXTYPE IS MDSYS.SPATIAL_INDEX_V2 PARAMETERS('layer_gtype=point cbtree_index=true');

Please note the INDEXTYPE part that is different from the regular: 
INDEXTYPE IS MDSYS.SPATIAL_INDEX_V2;


P.S. 

One more interesting option to test is the SDO_POINTINPOLYGON function that was described Jean Ihm here and Karin Patange here