Spatiotemporal Panel Datasets
Страница в процессе перевода. |
We often also need to missing data in spatiotemporal data. For this example, we’ll use daily temperature values from major cities around the world.
TODO: Give a different workflow/example using a DataFrame.
using AxisKeys, Impute, NamedDims, Plots, Statistics, StatsBase
# So NamedDimsArray is the outer wrapper
AxisKeys.nameouter() = true
# Construct a KeyedArray of our dataset as we want to track gaps (or missing rows)
# in the source CSV data.
data = wrapdims(
Impute.dataset("test/table/temperature"),
:AverageTemperature,
:dt,
:City;
default=missing,
sort=true,
)
# Rename our dims
data = rename(data, :dt => :time, :City => :loc)
2-dimensional NamedDimsArray(KeyedArray(...)) with keys:
↓ time ∈ 3239-element Vector{Dates.Date}
→ loc ∈ 100-element Vector{InlineStrings.String31}
And data, 3239×100 Matrix{Union{Missing, Float64}}:
String31("Abidjan") … String31("Xian")
Date("1743-11-01") missing missing
Date("1743-12-01") missing missing
Date("1744-01-01") missing missing
Date("1744-02-01") missing missing
Date("1744-03-01") missing … missing
Date("1744-04-01") missing missing
⋮ ⋱
Date("2013-03-01") 29.129 10.11
Date("2013-04-01") 28.626 12.563
Date("2013-05-01") 27.652 … 18.979
Date("2013-06-01") 26.157 23.522
Date("2013-07-01") 24.951 25.251
Date("2013-08-01") 24.541 24.528
Date("2013-09-01") missing missing
Okay, so let’s take a look at how much temperature data is missing.
heatmap(ismissing.(data); color=:greys);
So many cities are missing a lot of historical data. A common operation is to remove locations with too many missing historical observations. In our case, we also want to penalize observations closer to the present.
Let’s start to define a set of exponential weights for our observations:
wv = eweights(1:length(data.time), 0.001)
plot(wv);
Now we want to filter out locations (columns) according to those weights. For now, we’ll say that a location should be removed if the weighted ratio exceeds 0.1
.
data = Impute.filter(data; dims=:cols) do v
mratio = sum(wv[ismissing.(v)]) / sum(wv)
return mratio < 0.1
end
2-dimensional NamedDimsArray(KeyedArray(...)) with keys:
↓ time ∈ 3239-element Vector{Dates.Date}
→ loc ∈ 76-element Vector{InlineStrings.String31}
And data, 3239×76 Matrix{Union{Missing, Float64}}:
String31("Ahmadabad") … String31("Xian")
Date("1743-11-01") missing missing
Date("1743-12-01") missing missing
Date("1744-01-01") missing missing
Date("1744-02-01") missing missing
Date("1744-03-01") missing … missing
Date("1744-04-01") missing missing
⋮ ⋱
Date("2013-03-01") 27.196 10.11
Date("2013-04-01") 30.762 12.563
Date("2013-05-01") 33.856 … 18.979
Date("2013-06-01") 31.099 23.522
Date("2013-07-01") 28.073 25.251
Date("2013-08-01") 27.512 24.528
Date("2013-09-01") missing missing
Okay, so we removed almost 25% of the locations that didn’t meet our missing data requirement. However, most of our observations from the 1700’s are still mostly missing. Let’s remove those rows that have more 50% of the locations missing.
data = Impute.filter(data; dims=:rows) do v
mratio = count(ismissing, v) / length(v)
return mratio < 0.5
end
2-dimensional NamedDimsArray(KeyedArray(...)) with keys:
↓ time ∈ 2350-element Vector{Dates.Date}
→ loc ∈ 76-element Vector{InlineStrings.String31}
And data, 2350×76 Matrix{Union{Missing, Float64}}:
String31("Ahmadabad") … String31("Xian")
Date("1816-03-01") 24.87 missing
Date("1816-04-01") 29.427 missing
Date("1816-05-01") 32.624 missing
Date("1816-06-01") 30.632 missing
Date("1816-07-01") 26.727 … missing
Date("1816-08-01") 26.131 missing
⋮ ⋱
Date("2013-02-01") 22.726 2.362
Date("2013-03-01") 27.196 10.11
Date("2013-04-01") 30.762 … 12.563
Date("2013-05-01") 33.856 18.979
Date("2013-06-01") 31.099 23.522
Date("2013-07-01") 28.073 25.251
Date("2013-08-01") 27.512 24.528
Now let’s take a look at what data remains.
heatmap(ismissing.(data); color=:greys);
Alright, we can work with the remaining missing values now. Now we could try simply imputing the values columnwise for each city using something like Impute.nocb
heatmap(Impute.nocb(data; dims=:cols));
But, this looks rather crude and creates clear artifacts in the dataset. Since we suspect that observations in similar locations would have had similar recordings we could use Impute.svd
or Impute.knn
to find similarities across multiple locations.
data = Impute.knn(data; dims=:cols, k=4);
heatmap(data);
This method appears to have removed the artifacts found with the basic NOCB method alone. Now we have a complete dataset ready for downstream processing :)