Trouble Reading From "Large" Point Cloud Array

Using PDAL and the code snippets from the Geospatial section of TileDB’s web page, I was able to successfully load and then read LiDAR data from TileDB. For this simple test, I used a single .laz file. And everything worked great.

To scale things up a bit, I loaded .laz files covering approximately 65 square miles of ground. This is fairly dense LiDAR, and these laz files take up about 100GB of disk space. Once again, I used code snippets from the Geospatial section – specifically in the Parallel Writes section (I’m using Python). I loaded all of these laz files into a single array – using the append flag starting with the second laz file to the end. It understandably took a while to load all of these .laz – but the load seemed to complete without error.

Once the load was completed, I tried to crop out a small area using a PDAL crop filter. The script never completes – just hangs at pipeline.execute()

pipeline_json = """[
        {
            "type": "readers.tiledb",
            "array_name": "s3://tiledb-test-bucket/barry_cleaned",
        },
        {
            "type": "filters.crop",
            "bounds": "([667567,668686],[3314582,3315601])"
        },
        {
            "type": "writers.las",
            "filename": "c:/stuff/tiledb_croptest.laz"
        }
    ]"""

pipeline = pdal.Pipeline(pipeline_json)
pipeline.validate()
pipeline.loglevel = 8
pipeline.execute()
print("done")

So then, I tried just reading directly from TileDB (no PDAL). This hangs as well.

with tiledb.SparseArray("s3://tiledb-test-bucket/barry_cleaned",mode='r') as A:
     data = A[1:3, 2:5]
     print(data["a"])

Reading the schema of the array does return. So that makes me believe the array is at least valid…
schema = tiledb.ArraySchema.load("s3://tiledb-test-bucket/barry_cleaned")

I’m wondering if there is some practical size a single array should be capped at? Or any other obvious things I have done wrong here?

@ferraror I have updated our documentation to reflect https://github.com/PDAL/PDAL/issues/2891. I wonder if this is the issue in your pipeline in that you are calling validate.

I recently ingested a large group of laz files in a similar way to you (about 200 GB) and it was successful so I will do some more debugging into possible issues and create an example we can work through.

@Norman_Barker Thanks for your response/looking into this. I have removed the pipeline.validate() from my script - still getting hung at pipeline.execute()

In going back through my loading code - it seems I omitted compression & compression level in my pipelines writers.tiledb stage that was used in the snippet on TileDB page. I did use a chunksize of 50000:

“compression”:“zstd”,
“compression_level”: 75

Is it possible that omitting compression & compression level are responsible?

@ferraror I see two issues here, the first is with the use of the crop filter, it is more efficient to perform this on the TileDB reader directly with a bbox3d argument. e.g.

[
 {
    "type": "readers.tiledb",
    "array_name": "sample",
    "bbox3d": "([-11490800,-11475000],[3860000,3868000],[1025,1180])"
 },
 {
    "type": "writers.las",
    "filename": "cropped.laz"
 }
]

However I am seeing that this is crashing in PDAL with 2.0.1 and the dataset I have, we do test with a small sample https://github.com/PDAL/PDAL/blob/master/plugins/tiledb/test/TileDBReaderTest.cpp#L75 so I will debug this more with a larger dataset. If you can share your data I will try with that.

Secondly, the append mode intentionally does not consolidate the array, depending on how you are creating the array you may wish to consolidate prior to any querying - https://docs.tiledb.com/developer/api-usage/consolidating-arrays

Can you share how many files you are appending to the initial array? I have a simple demo set up with public data to load laz files into an existing array and I can test out your use case if you cannot share your data. We are also working on improvements within the PDAL TileDB driver for this.

@Norman_Barker I’ll see what I can do about sharing the input .laz files with you. Will probably take 1-2 days to get the required permission for me to do so… I can tell you I loaded 772 .laz files into a single TileDB array.

I have started a consolidate operation. Since my tiledb is currently in s3 - documentation says to avoid any reads while consolidate is in progress. So I will try the bbox3d (vs crop) when the consolidate is complete - and let you know how that goes.

We are in the process of improving the consolidation process. In the future, you will not need to do this manually, or avoid reading while consolidating. This change will probably make it to production by early April.

Please let us know if you experience any issues with the current consolidation algorithm. There are various configuration parameters you can tweak to improve performance.

just to speed things up some - i copied the tiledb from s3 to local. i then fired off the consolidate. it looks like it ran out of memory.
Capture

Consolidating locally is a good idea. I would create a configuration object and set the following configs:

"sm.consolidation.buffer_size" = 10000000 (set internal buffer per attribute to ~10MB)
"sm.consolidation.step_max_frags" = 10 (set number of fragments to consolidate per step to 10)

This will hierarchically consolidate the fragments, 10 at a time (always consolidating smaller fragments first, then larger), until all fragments are consolidated to 1. I would start by tweaking these two parameters.

See example for setting a config to consolidation here.

All the configuration parameters are included here.

Please

I’m still running into what I believe to be a memory error. Per your suggestion, I tried running a consolidation with the following configuration:

config = tiledb.Config()
config[‘sm.consolidation.buffer_size’]=10000000
config[‘sm.consolidation.step_max_frags’]=10
ctx = tiledb.Ctx(config)
tiledb.consolidate(‘/home/ubuntu/barry_cleansed’,ctx=ctx)

When I executed this, I got this error:

Invalid configuration; Minimum fragments config parameter is larger than the maximum

Looking through the configuration parameters documentation you linked to - I see the defaults for both sm.consolidation.step_max_frags & sm.consolidation.step_min_frags to be 4294967295. So since I set step_max_frags to 10, I updated my script to set step_min_frags to 10 as well:

config = tiledb.Config()
config['sm.consolidation.buffer_size']=10000000
config['sm.consolidation.step_max_frags']=10
config['sm.consolidation.step_min_frags']=10
ctx = tiledb.Ctx(config)
tiledb.consolidate('/home/ubuntu/barry_cleansed',ctx=ctx)

When I made this change, the consolidate operation ran for a while (maybe 30min-1hr) before just dying with no error message. ‘Killed’ just written out to console making me believe out of memory…

Is there a way for us to get access to the data (we can also sign an NDA if needed)?

In the meantime, could you please test with max/min frags equal to 5, and "sm.consolidation.steps" = 1? This will consolidate exactly 5 fragments. Then, please check whether the total number of fragments reduced by 4 (removing 5 old ones, but adding a new one).

Stavros