ZTRAX is subjected to quality issues that include spatial, temporal, and thematic uncertainties that propagate into the gridded surfaces contained in the HISDAC-US. In part, these uncertainties have been quantified in previous work23,24. For a thorough positional accuracy assessment of the gridded surfaces in HISDAC-US, over time and across the rural-urban continuum, we direct the reader to Uhl et al. (2021a), who report regionally varying levels of positional agreement. Uhl et al. (2021a) provide important insights into the quantity agreement of the ZTRAX-derived grid cell aggregates of built-up records and locations, as compared to building footprint data, census population and housing unit counts. Such reported disagreements also propagate into the land use data layers described herein, and the user of any data products from the HISDAC-US is urged to refer to these validation results to reflect the accuracy of the data appropriately. Based on this validation, it is known that while the completeness in HISDAC-US is acceptable for data layers after 1900, all products derived from ZTRAX will be subject to underestimation due to the difficulties of obtaining structural records from counties that have differing reporting policies, attribute incompleteness and inconsistency, and the dynamic nature of development. The level of underestimation for residential records was assessed in Uhl et al. (2021a) who reported varying levels of incompleteness of records along the rural-urban continuum in comparison to Census housing unit counts.
Herein, we assessed the completeness of land use and year-built attributes in ZTRAX (Section 4.1) and employed three ancillary datasets to quantitatively and qualitatively address uncertainties specific to the land use product. Specifically, we used land use data from volunteered geographic information (i.e., OpenStreetMap, OSM) (https://planet.openstreetmap.org) to assess the agreement with the created (contemporary) land use layers (Section 4.2), and compared our land use layers to remote-sensing-derived land cover/land use (LULC) data from the National Land Cover Database 200169 and 201670, as well as to urban land use classes from the Local Climate Zones (LCZ)71 dataset available for the CONUS (Section 4.3). In addition to that, we used data on building demolitions to quantify effects of survivorship bias, as building replacements or teardowns are not recorded in ZTRAX (Section 4.4). Finally, we used overhead imagery and a visual-analytical approach to assess the visual consistency of buildings at ZTRAX locations for different land use categories (Section 4.5).
Grid cells with small structural counts and low attribute completeness should be carefully considered as cell value assignment in the main data product was based on the most frequently occurring land use class within the grid cell extent. In such cases, the user is advised to use the land use type count layers in conjunction with uncertainty layers to better understand the underlying reliability of the data. Table 3 summarizes attribute missingness statistics for the land use data product, indicating that over 98% of the ZTRAX records have valid land use information (lower levels are found e.g. in Maine or Iowa, see Fig. 2a), and over 75% of ZTRAX records have valid land use and year built information. The majority of ZTRAX records have valid location information (Fig. 2b). Around 25% of the data are lacking land use and year-built information, and these are located in approximately 400 counties (see Fig. 2c), that can be also identified by the county-level completeness layers (Section 3.2).
Comparison to OpenStreetMap land use data
While detailed and reliable land use data is sparse, OpenStreetMap (OSM) offers user-generated land use and functional information at the building level. While OSM is not expected to have high completeness in terms of the land use attribute, we assume the reported land use information to be accurate. We generated gridded surfaces, aligned with the HISDAC-US land use data grids, containing the number of buildings of a given land use type in OSM per grid cell, and conducted a cell-level agreement assessment. We mapped the relevant OSM land use types to the land use classification scheme of the presented HISDAC-US land use data. Moreover, due to the sparsity of some land use classes in both datasets, and the potentially large bias introduced by this, we only evaluated the three most frequent land use classes: residential, commercial, and industrial.
Preliminary tests have shown that a considerable amount of building footprints in OSM are lacking the land use attribute and thus, its completeness in OSM appears to be low in certain regions of the CONUS, while the correctness of those attributes that exist is expected to be high. Thus, only Type II errors (i.e., comission errors) in the ZTRAX-derived land use data can be quantified by comparing against the OSM data. For the evaluation of commission error, please refer to Section 4.3 (comparison to remotely-sensed LULC data) and Section 4.4 (Survivorship bias). Note that this assessment was done for the most recent point in time of the presented land use layer series (i.e., 2016) and for contemporary OSM data downloaded in 2021, to keep the temporal gap to a minimum. We carried out these assessments for individual counties as well as across the rural-urban continuum using the Rural-Urban Continuum Codes (RUCCs) created by the U.S. Department of Agriculture (USDA)72,73. RUCCs define nine rural-urban classes, including three metro and six non-metropolitan county designations using criteria of population size, the degree of urbanization and adjacency to a metro area (Table 4).
Given these constraints in the OSM reference data, we first extracted all grid cells containing at least one OSM and ZTRAX derived record of the same land use class and assessed the correlations of the grid cell counts, as a measure of quantity agreement. Due to the ZTRAX data structure and spatial generalization effects, these distributions can contain outliers, resulting from large numbers of records in individual grid cells24, and thus, we used Spearman’s rank correlation coefficient for this assessment. Moreover, we calculated the recall (i.e., producer’s accuracy, or sensitivity) of the ZTRAX-derived land use counts with respect to the OSM in order to quantify the omission errors associated with the ZTRAX-derived data. The latter was done based on binarized absence-presence gridded surfaces, using a threshold of at least one record per grid cell, and thus allowing for measuring the Type II error component of the positional agreement between the ZTRAX and OSM derived surfaces. We quantified both, correlations of grid-cell level counts and the spatial agreement (i.e., recall) for each of the three land use classes under test for the whole CONUS, and across the rural-urban continuum, by conducting stratified assessments for grid cells located in counties of each RUCC (Table 4), as well as for each individual county (Fig. 3).
First, we observe positive correlations between building counts of ZTRAX and OSM land use classes across CONUS (>0.33 across all RUCCs for any class), and these correlations are highest for the residential class in highly urban environments (i.e., RUCC 1, c = 0.66). Correlations generally decrease towards more rural settings, where both ZTRAX and OSM completeness can be low. The completeness of ZTRAX land use records appears to decrease from the residential to the commercial and industrial classes, yielding recall values over all RUCCs of up to 0.77, 0.61 and 0.34, respectively. Recall values across the RUC follow similar patterns as the correlations, exhibiting highest values for the residential class in urban settings (RUCC 1, recall = 0.88) and lowest values in rural settings for the industrial class.
While these general patterns illustrate the broad-scale agreement between ZTRAX and OSM based land use data, we observe strong local variations of uncertainty at the county level, as the distributions of Spearman’s rank correlation coefficients and recall measures calculated at the county level suggest (Fig. 3a,b). As the upper tails of these distributions indicate, there is a considerable number of counties that exhibit very high quantity and positional Type II agreement between ZTRAX and OSM. Decomposing the distributions of county-level agreement metrics across the rural-urban continuum, we observe that while the overall agreement metrics in Table 4 decrease from urban towards rural regions, this trend is less visible in the county-level metrics (Fig. 3c,d). For example, a considerable number of rural counties (RUCC 6–9) exhibit high recall values for residential and commercial land use classes. We would like to emphasize that different factors such as spatial, temporal, and semantic inconsistencies between ZTRAX and OSM data, as well as the user-generated nature of the OSM database and associated uncertainty issues affect the presented agreement assessment results, underlining the difficulty in conducting land use data validation in general. However, these results suggest that the surfaces representing contemporary land use are largely coherent to the independently collected and compiled OSM data and thus, represent a reliable and plausible proxy for land use distributions in most regions of the CONUS.
Comparison to remote-sensing-based LULC datasets
We used gridded land cover data from the NLCD in 200169 and 201670, as well as gridded LCZ urban land use (temporally referenced approximately in 2016–2018) available for the CONUS71. We implemented two approaches for this comparison: First, we implemented a record-based approach: We drew a stratified random sample of ZTRAX records, retrieved the land use/climate zone from the underlying NLCD and LCZ grids at the location of each record, and cross-tabulated the land use class of each ZTRAX record and the respective NLCD and LCZ class labels found at the respective location. Specifically, we randomly selected one county for each of the nine RUCCs, within each of the nine U.S. census divisions72,73. We then retrieved the ZTRAX property records within these counties and drew a sample of n = 1,000 records (with replacement) from each of our six land use classes (cf. Table 1) per county. This way, we obtained a sample of N = 486,000 ZTRAX records, located within 81 U.S. counties uniformly distributed across the CONUS, and across the rural-urban continuum, and equally proportioned across the land use classes used herein.
Second, we implemented a raster-based approach: We down-sampled the NLCD gridded surfaces from their native resolution of 30 m and the LCZ data from 100 m into the HISDAC 250 m grid using two resampling techniques: 1) a majority-area rule, and 2) using 1-hot encoding, i.e., creating a binary 250 m gridded surface for each NLCD and LCZ class, encoding the presence of each class with 1, and the absence with 0. This way, we were able to evaluate the correspondence of our land use classes also to underrepresented classes in NLCD and LCZ, which are likely to disappear when using majority-area resampling. Similarly, we created a binary surface in our 250 m grid indicating the presence (1) or absence (0) of records of any of our six land use classes, based on the land-use-specific property count surfaces (cf. Fig. 2). We compared these data layers by cross-tabulating our land use based binary surface with the binary surfaces of each of the LULC classes from NLCD and LCZ, respectively.
We compared NLCD 2016 and LCZ to our 2016 layers, and to minimize the effects of temporal inconsistencies, we compared the NLCD 2001 to our layers referenced in the year 2000. These different strategies (record-based and raster-based) allowed for gaining a relatively unbiased picture of the correspondence between our land use classes and remotely sensed LULC types. The record-based approach evaluates the correspondence between ZTRAX and the LULC datasets without being affected by additional uncertainty induced by the resampling. However, it only evaluates thematic agreement where ZTRAX records are available, disregarding omission errors. The raster-based approach may suffer from additional positional uncertainty due to resampling from 30 m (NLCD) and 100 m (LCZ) resolutions to the target resolution of 250 m but enables the quantification of class-specific omission errors in regions where no ZTRAX records are available.
The record-based comparison (Fig. 4, top part) revealed very similar patterns for NLCD 2001 and NLCD 2016: Highest proportions of income and owned residential ZTRAX records are located in the NLCD classes “Developed, low intensity” and “Developed, medium intensity”, whereas industrial and commercial land uses have highest proportions in “Developed, high intensity”, in particular in urban counties. Agriculturally used properties have highest proportions within “Pasture/Hay” and “Cultivated crops”. Comparing to the local climate zones (Fig. 4, bottom part) shows that the highest proportions of ZTRAX records for most land use classes are located within the “Open low-rise” class, except for the agricultural land use class, which peaks in the “Low plants” and “Dense trees” LCZ classes.
Some of these cross-tabulations seem implausible, such as ZTRAX records located in wetlands or open water. It is likely that these are artefacts due to the resampling, and the spatial resolution of the HISDAC-US land use data layers. In the least optimistic scenario, we can consider these mismatches to be commission errors (i.e., ZTRAX reports built-up structures that do not exist). In that case, these commission errors quantifiable by the conducted cross-comparison would sum up to only 4–5% of all ZTRAX records. Here, it is worth noting that commission errors in ZTRAX may also occur due to demolished buildings that have not been deleted or updated (i.e., set to “vacant” land use) in ZTRAX. However, these cases are likely not to exceed 1–2% of all ZTRAX records (see Section 4.4).
The raster-based approach reveals a complementary picture. As shown in Table 5, for most non-settlement and vegetation-dominated NLCD classes, only small area proportions are covered by grid cells containing one or more ZTRAX records. This trend inverts for the settlement-related land cover classes: For example, over 82% of “Developed, low intensity” land cover in 2016 geographically coincides with the land use data described herein. This agreement is lower for the NLCD 2001 data, as grid cells without temporal information are counted as “% not covered by HISDAC”. This trend persists across the two different data resampling techniques. Larger differences in these proportions between majority-based resampling and 1-hot encoding indicate that the land cover classes (e.g., “Developed, low intensity”) are underrepresented and/or spatially scattered and thus, disappear when using majority-based resampling. Moreover, when distinguishing these cross-tabulations in proportions of the HISDAC-covered and not covered area (Table 5, bottom part), we observed that the highest proportion of not covered area is shrubland/scrub (i.e., 23%), and highest proportions of the HISDAC-covered areas are located in “Deciduous Forest” and “Pasture/Hay” (agricultural class, cf. Figure 4), followed by the developed classes.
The raster-based cross-tabulations with LCZ classes show a similar pattern: The majority of the settlement-related and built-up classes (e.g., compact and open high-rise, etc.) are covered by HISDAC-US (Table 6, left part), whereas most vegetation-dominated LCZ classes are not covered. However, we observed some exceptions deviating from this trend: For example, the “Open midrise” class is mostly not covered in HISDAC-US. A reason could be public buildings that are omitted in HISDAC-US24 and are not considered in the land use data presented herein. Moreover, only 10–12% of the grid cells labelled as “Heavy industry” are covered in HISDAC-US. This may be caused by spatial offsets, as industrially used parcels may be very large, but also indicates a relatively poor coverage of industrial land use, which is also in line with observations made when comparing our data to OSM (Section 4.2). Conversely, the highest proportions of HISDAC-US-covered area in LCZ is classified as “Open low-rise”, “Dense trees” or “Low plants”. This is plausible as dense, urban settlements only represent a small portion of the US built environment, and peri-urban and rural settlements as well as agriculturally used structures are typically spatially scattered and thus, as a result of the resampling process, “occupy” a larger proportion of grid cells than built-up properties in dense, urban settings.
It is worth noting that due to the different properties of the data compared herein (i.e., discrete locations vs. categorical and density information contained in gridded surfaces), positional uncertainty in both the LULC data (e.g., induced by the registration accuracy of the underlying remote sensing data) and in the ZTRAX data (e.g., using parcel centroids or address points instead of the locations of actual built-up structures) may introduce additional uncertainty in these cross-comparisons. However, the aggregation of the NLCD and LCZ datasets from fine resolutions to the target resolution of 250 m is assumed to mitigate such bias partially.
Assessing survivorship bias in the historical data
Survivorship bias presents a problem that appears in several disciplines and is of particular importance to most types of settlement, land use, or building stock data74,75,76,77,78. This type of bias appears when units, such as a built structure, are removed from the population but are not accounted for in the data. For example, a structure built in 1930 may get remodelled or may get demolished over time79. ZTRAX does not directly account for demolished structures and therefore does not continue to represent structures that no longer exist. The described land use data product suffers from the same limitation in that only surviving buildings are considered without accounting for possible structural losses. To demonstrate and measure the effects of this survivorship bias for Colorado, we used address-level demolition data over 10 years (2008–2017) obtained from the Colorado State archives (https://spl.cde.state.co.us/artemis/heserials/he171017internet/).
We stratified the counties in Colorado by their RUCC, and found that demolitions took place in urban counties at more than 10 times the rate of demolitions in rural counties; out of 28,403 possible demolitions, 26,011 occurred in rather urban counties (i.e., RUCC designations 1–5). We grouped RUCC 1–5 as urban counties and RUCC 6–9 as rural for all analysis using RUC codes. Comparing the total amount of demolitions occurring between 2008 and 2017 to the total number of built structures in 2015, we estimate that approximately 1.1% of Colorado’s building stock was demolished during this time period (thus an average annual rate of 0.11%). At the county scale we found, for both rural and urban counties, that the maximum percentage of demolished building stock did not exceed 2.5% during the 10-year period. As mentioned before, this observation also provides an estimated upper bound of potential commission error in ZTRAX and the derived land use datasets: There may be cases where demolished buildings are not reconstructed, and the demolition is not updated in ZTRAX, leading to a false positive (i.e., commission error). Furthermore, we refer to Uhl et al. (2021) where commission errors of ZTRAX-based settlement layers were quantified, and high levels of precision were observed in contemporary, urban settings, dropping to around 0.7 in rural settings and early time periods.
Moreover, we matched the demolition records to the ZTRAX records based on the address information given in both datasets and assessed the relationships between the demolition year and the year built on record in ZTRAX, separately for urban and rural counties (Fig. 5). The scraped demolition data contained a total of 33,645 addresses of which we were able to match 28,403 records to the ZTRAX data, leaving 5,242 records unmatched. These unmatched records may represent structures that have disappeared completely and are not contained in the ZTRAX data, or they are an artefact of our matching process, which was address-based and thus may be prone to misspelling errors in the addresses. We noted multiple instances in the scraped demolition data that were incorrectly spelled or had inconsistent formatting. From the analysis of this data we observed the following: (1) Most buildings that were demolished do not have a valid year-built attribute, and this proportion is higher in rural than in urban counties. This indicates that missing year-built attributes may be a result of recent teardowns, and possibly reconstruction, and a reporting latency that appears to be higher in rural counties. (2) Only a small percentage of demolished buildings (around 10% in 2008) have a year built > = year demolished. These records represent rebuilding activity that likely caused a replacement of the prior year built and represent the survivorship bias in ZTRAX- derived building age information. (3) Around 20% in 2008, and 50% in 2016 of demolitions did not cause an update of the year built on record in ZTRAX (year built <year demolished). This could imply several things: (a) The buildings were demolished and not replaced (i.e., they “disappeared”), but data records were not updated. This would illustrate an important limitation of our data, i.e., the shrinkage of human settlements cannot be measured. (b) The buildings were demolished and replaced, but the year built was not updated. This scenario would reduce the survivorship bias with respect to building age (i.e., the “original” year built persists); and (c) The buildings were demolished and replaced, and in addition to that, the building function changed: This would be an example of historical land use change not captured in our data. Furthermore, only a very small portion of demolished buildings is labelled as “vacant” in ZTRAX, indicating that most demolitions are followed by immediate reconstruction, or these cases are underreported in ZTRAX, which again would be an example of the inability to capture the shrinkage of built-up land in ZTRAX.
In conclusion, while we acknowledge the uncertainty due to survivorship bias contained in our data and generated by our modelling approach, it is clear that even in the most conservative scenarios that use verifiable data, survivorship bias would have minimal impact on analytical outcomes. As described above, we assume that land use for a structure was designated at the time the structure was built as this would have been the time that construction records/permits were submitted to the county assessor. Thus, some uncertainty remains unaddressed if buildings were built in parcels that have been re-zoned, and thus their land use designation may have changed, at some point in time. However, different kinds of land use changes over time have different transition likelihoods. We illustrate this in Table 7 to provide a basis for identifying land use classes that may be prone to this type of thematic uncertainty if past land use changes have not been recorded in the database.
Qualitative comparison to overhead imagery
Lastly, we used Bing aerial imagery (https://www.arcgis.com/home/item.html?id=8651e4d585654f6b955564efe44d04e5) to qualitatively assess the relationship between broad-scale patterns of land surface in remotely sensed earth observation data and the different land use classes recorded in ZTRAX. To do so, we randomly selected one location per land use class, for each of the 3,019 covered counties, resulting in a total of 18,114 sample locations. We then obtained the RGB Bing imagery within a bounding box of 100 × 100 meters around each ZTRAX location, and generated a mosaic of these individual images, per land use class. These mosaics are based on a method proposed and employed in Uhl et al.80 which involves the calculation of color moments81 for each image, resulting in a 12-dimensional low-level descriptor summarizing the color content of each image. We then use t-distributed stochastic neighbour embedding (T-SNE)82 to map these 12-dimensional descriptors into a bi-dimensional space. T-SNE arranges data points in the bi-dimensional target space in a way that similar data points are located near each other. We then rectified the resulting 2-d point cloud and visualized each image at its corresponding location in t-SNE space. This method groups similar images together and allows for an integrated visual assessment of large amounts of images (Fig. 6). These visualizations characterize the broad-scale patterns of geographic contexts encountered at the ZTRAX locations per land use class. For example, they illustrate the quantity of vegetation-dominated settings, which are most frequent in the agricultural land use class. Small buildings are commonly found in the agricultural and residential land use classes. Large bright objects represent the (typically flat) roofs of large industrially, commercially, or recreationally used structures and seem to occur commonly at locations of these three land use classes. Note that images containing vegetation only are likely seen due to positional offsets of ZTRAX point locations from the actual building locations within rural (often larger) cadastral parcels50. However, we can assume that these offsets have a minor effect on the data accuracy due to the chosen spatial resolution of 250 × 250 m, as recent multi- scale accuracy assessments have suggested24. Thus, the visual inspection of the t-SNE plots in Fig. 6 reveals plausible matches for most sample locations. This technique could be used to systematically refine larger-scale samples for building level verification to conduct quantitative accuracy assessments, as far as building function can be inferred from overhead imagery.
In this analytical effort, we used OpenStreetMap, demolition data records, remote-sensing derived land cover classifications and overhead imagery as comparative data sources, being aware that none of these external data sources represent optimal ground truth to evaluate the quality of the created land use layers. While these comparisons do not quantify the uncertainty in historical land use data, they highlight important data quality aspects and properties that help to better understand the completeness and inherent bias in the data product.