[ ]:
import logging
from pystatis import Table
logging.basicConfig(level=logging.INFO)
The Table
Class
The Table
class in pystatis
is the main interface for users to interact with the different databases and download the data/tables in form of pandas
DataFrames
.
To use the class, you have to pass only a single parameter: the name
of the table you want to download.
[ ]:
t = Table(name="81000-0001")
Downloading data
However, creating a new Table
instance does not automatically retrieve the data from the database (or cache). Instead, you have to call another method: get_data()
. The reason for this decision was to give you full control over the download process and avoid unnecessary downloads of big tables unless you are certain you want to start the download.
[ ]:
t.get_data()
You can access the name of a table via the .name
attribute.
[ ]:
t.name
After a successful download (or cache retrieval), you can always access the raw data, that is the original response from the web API as a string, via the .raw_data
attribute.
[ ]:
print(t.raw_data)
More likely, you are interested in the pandas
DataFrame
, which is accessible via the .data
attribute.
[ ]:
t.data.head()
Finally, you can also access the metadata for this table via the .metadata
attribute.
[ ]:
from pprint import pprint
pprint(t.metadata)
How pystatis
prepares the data for you
As you can notice from a comparison between the .raw_data
and .data
formats, pystatis
is doing a lot behind the scenes to provide you with a format that is hopefully the most useful for you. You will see and learn that there are a few parameters that you can use to actually change this behavior and adjust the table to your needs.
But first we would like to explain to you how pystatis
is preparing the data by default so you have a better understanding of the underlying process.
When we look at the header of the raw data, we can notice a few things:
Many columns always come in a pair of
*_Code
and*_Label
columns. Both contain the same information, only provided differently.There are columns that don’t have a direct use as they contain information not needed in the table, like the
Statistik_Code
andStatistik_Label
columns at the beginning. You already know the statistic from the name of the table and this information is the same for each and every row anyway.There is always a time dimension, broken down into three different columns
Zeit_Code
,Zeit_Label
andZeit
(ortime_*
in English).The other dimensions are called variables (German “Merkmale”) and they always come in groups of four columns:
N_Merkmal_Code
,N_Merkmal_Label
,N_Auspraegung_Code
, andN_Auspraegung_Label
(English: variable code and label and variable value code and label).The actual measurements or values are at the end of the table after the variables and each measurement has one column. The name of this column follows the format
<CODE>__<LABEL>__<UNIT>
, e.g. “BWS001Bruttowertschoepfungjew._ME”. “BWS001” is the unique code for this variable, “Bruttowertschoepfung” is the human readable label of the variable, and “jew._ME” is the unit the measurement was recorded in.
Note This is only true for tables from Genesis and Regionalstatistik, the format of the Zensus tables is noticeably different from this. However, we follow a similar approach to provide you the same convenient output format.
The following table hopefully makes it a little bit clearer what is happening when going from the raw data string to the pandas DataFrame
. The example is showing the Table “11111-02-01-4” from Regionalstatistik, but remember, that Genesis and Regionalstatistik have identically formats. The table has a time dimension, one attribute and one value.
Statistik_Code |
Statistik_Label |
Zeit_Code |
Zeit_Label |
Zeit |
1_Merkmal_Code |
1_Merkmal_Label |
1_Auspraegung_Code |
1_Auspraegung_Label |
GEM001Zahl_der_GemeindenAnzahl |
---|---|---|---|---|---|---|---|---|---|
11111 |
Feststellung des Gebietsstandes |
STAG |
Stichtag |
31.12.2022 |
KREISE |
Kreise und kreisfreie Städte |
DG |
Deutschland |
10786 |
11111 |
Feststellung des Gebietsstandes |
STAG |
Stichtag |
31.12.2022 |
KREISE |
Kreise und kreisfreie Städte |
01 |
Schleswig-Holstein |
1106 |
The same table has the following pandas representation after being “prettified” by pystatis
:
[ ]:
t = Table("11111-02-01-4")
t.get_data()
t.data.head(2)
As you can see and hopefully agree, the pandas version (what we call “prettified”) provides the same information, actually even more, because the header column names have become meaningful and there is a lot less noise that you need to filter out before you can get to the actual data.
For Zensus pystatis
is basically doing the same, but in a slightly different way because since the release of Zensus 2022 the API no longer returns each measurement as a single column but only a single column for all values. pystatis
is transforming this long data format back into a wide data format, so you can work with a tidy data set. See the following example of Table “4000W-1002” to understand what is going on.
statistics_code |
statistics_label |
time_code |
time_label |
time |
1_variable_code |
1_variable_label |
1_variable_attribute_code |
1_variable_attribute_label |
2_variable_code |
2_variable_label |
2_variable_attribute_code |
2_variable_attribute_label |
value |
value_unit |
value_variable_code |
value_variable_label |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4000W |
Wohnungen (Gebietsstand 15.05.2022) |
STAG |
Stichtag |
2022-05-15 |
GEODL1 |
Deutschland |
DG |
Deutschland |
WHGFL2 |
Fläche der Wohnung (10 m²-Intervalle) |
WFL170B179 |
170 - 179 m² |
1,2 |
% |
WHG002 |
Wohnungen in Gebäuden mit Wohnraum |
4000W |
Wohnungen (Gebietsstand 15.05.2022) |
STAG |
Stichtag |
2022-05-15 |
GEODL1 |
Deutschland |
DG |
Deutschland |
WHGFL2 |
Fläche der Wohnung (10 m²-Intervalle) |
WFL170B179 |
170 - 179 m² |
509041 |
Anzahl |
WHG002 |
Wohnungen in Gebäuden mit Wohnraum |
4000W |
Wohnungen (Gebietsstand 15.05.2022) |
STAG |
Stichtag |
2022-05-15 |
GEODL1 |
Deutschland |
DG |
Deutschland |
WHGFL2 |
Fläche der Wohnung (10 m²-Intervalle) |
WFL090B099 |
90 - 99 m² |
7,2 |
% |
WHG002 |
Wohnungen in Gebäuden mit Wohnraum |
4000W |
Wohnungen (Gebietsstand 15.05.2022) |
STAG |
Stichtag |
2022-05-15 |
GEODL1 |
Deutschland |
DG |
Deutschland |
WHGFL2 |
Fläche der Wohnung (10 m²-Intervalle) |
WFL090B099 |
90 - 99 m² |
3082890 |
Anzahl |
WHG002 |
Wohnungen in Gebäuden mit Wohnraum |
[ ]:
t = Table("4000W-1002")
t.get_data()
t.data.head(2)
As you can see, pystatis
is not only increasing readability and making data access easy, it also reduces the amount of data you have to work with. Going from a long format back to a tidy wide format means cutting the number of rows to 1/3 because all three measurements get back their own column.
pystatis
is doing the following things (by default) when parsing the original raw string:
remove the information about the statistic
for all variables: only keep the value column and choose the variable label as the column name
for all measurements: remove the variable code from the column name, only keep label and unit
set the proper data types (
datetime
for the time variable, if appropriate;str
for regional codes)handling missing values (i.e. replacing characters “…”, “.”, “-”, “/” and “x” by proper
NaN
values) and special characterschoosing the right decimal character depending on the specified language (German: “,”, English: “.”)
All of this happens behind the scenes when you are downloading the data with get_data()
and access it via the Table.data
attribute.
All get_data()
parameters explained
You can find a list of all parameters in the documentation or in the docstring. All parameters are keyword parameters only (fancy Python star syntax: f(*, everything from here on has to be a keyword only parameter)
).
[ ]:
?t.get_data
prettify
prettify
is a boolean and can only be True
or False
. The default is True
because prettify
is basically doing all the above mentioned work behind the scenes to transform the raw data into the nicer tidy version. However, as we don’t know what specific requirements you have, it can always be the case that we are not doing what you want to do or we are doing it in a wrong way. Instead of starting from scratch with the raw string, prettify=False
will still give you a pandas
DataFrame
but without the transformations described in the previous sections. Basically, prettify=False
gives you the raw data as a pandas DataFrame
instead of a string without any transformation from our side.
[ ]:
t = Table("1000A-0000")
t.get_data(prettify=False)
t.data.head(3)
[ ]:
# don't be confused by the query, we have to query by ARS in this example because prettify=True sorts the data by ARS and the order is different from above
t = Table("1000A-0000")
t.get_data(prettify=True)
t.data[
t.data["Amtlicher Regionalschlüssel (ARS)"].isin(
["092760130130", "073355011022", "130765654053"]
)
]
[ ]:
t.data.info()
area
We don’t have a good explanation for this one, so if you have a concrete use case, please let us know!
Here is the description from the official documentation:
The area query parameter specifies the area in which the object is stored, which is analogous to online navigation. Here is the breakdown:
For internal users:
Meine/Benutzer
Gruppe
Amt
Katalog/Öffentlich
Alle
For external users:
Meine/Benutzer
Katalog/Öffentlich
This parameter corresponds to:
Bereich=Benutzer as Bereich=Meine
Bereich=Öffentlich as Bereich=Katalog
startyear
, endyear
and timeslices
All three parameters can be used to fetch data of a certain time range for the given Table. The default is Table specific and has to be checked for each Table, often it is just the latest period of time available.
The important thing here is that timeslices
is cumulative to the other two options, meaning that timeslices=N
will give you N years after startyear
or before endyear
.
Let’s say you are interested in school-leaving qualifications over the years in Germany. Then Table 21111-0004 might be of interest to you. The description of the table mentions that data is available for the years 1997/98 - 2021/22. But what will the API return if you specify no time parameter?
[ ]:
t = Table("21111-0004")
t.get_data()
t.data["Schuljahr"].unique()
As you can see, pystatis
only returns you, for whatever reason, the years 2020/21 and 2021/22. How can you get the ten latest years? Let’s see:
[ ]:
t.get_data(timeslices=10)
t.data["Schuljahr"].unique()
[ ]:
t.get_data(startyear="2012")
t.data["Schuljahr"].unique()
If you are only interested in a time period somewhere in between, you need to use both startyear
and endyear
:
[ ]:
t.get_data(startyear="2012", endyear="2015")
t.data["Schuljahr"].unique()
You might expect that using startyear
and timeslices
might give the same result, but it turns out that this is not the case and quite misleading. In fact, timeslices
is always coming on top of whatever you have selected with startyear
and endyear
. Is that confusing? We definitely think so!
[ ]:
t.get_data(
startyear="2012", endyear="2015", timeslices=3
) # gives everything between 2012 and 2015 three more years
t.data["Schuljahr"].unique()
[ ]:
t.get_data(
endyear="2015", timeslices=3
) # gives everything up to 2015 and three more years
t.data["Schuljahr"].unique()
regionalvariable
and regionalkey
Tables that end with a “B” in Regionalstatistik are special: They allow to change the regional depth of the data, meaning that you can fetch data for different regional areas depending on these two variables. The same is true for all Zensus tables.
To select a specific region area, you can either specify regionalvariable
and pass one of the reserved codes for this geo variable, or you can directly select a specific region via its key. Let’s see some examples, so let’s analyze Table 12613-01-01-5-B:
[ ]:
t = Table("12613-01-01-5-B")
t.get_data()
t.data.head(5)
Instead of fetching the data for all municipalities, we can choose a different regional depth (see the codes here), for example “KRESIE”, one level above “GEMEINDE”, which is the default for this table.
[ ]:
t = Table("12613-01-01-5-B")
t.get_data(regionalvariable="KREISE")
t.data.head(5)
regionalkey
can be used to fetch only certain areas, see https://datengui.de/statistik-erklaert/ags. We now fetch only municipalities in Baden-Württemberg:
[ ]:
t = Table("12613-01-01-5-B")
t.get_data(regionalkey="08*")
t.data.head(5)
stand
Can be used to only download tables that have a version newer than the given date.
[ ]:
t = Table("21111-0004")
t.get_data()
t.data.head(5)
[ ]:
t.metadata["Object"]["Updated"]
[ ]:
t.get_data(stand="01.01.2023") # before updated date, so should return data
t.data.head()
[ ]:
t.get_data(stand="01.12.2024") # after updated date, so error
t.data.head()
language
language
can either be “de” or “en, with “de” being the default, obviously. Regionalstatistik is not supporting “en” and will not translate any data, Genesis and Zensus have some support for English, but you have to check for yourself, if the data is translated and to what extend.
[ ]:
t = Table("81000-0001")
t.get_data()
t.data.head(1)
[ ]:
t = Table("81000-0001")
t.get_data(language="en")
t.data.head(1)
quality
quality
can be either “on” or “off”, with “off” being the default. When switching to “on”, the downloaded table has additional quality columns “__q” for each value column with quality symbols. Check Explanation of symbols Not supported for all tables or databases.
[ ]:
t = Table("52111-0001")
t.get_data(quality="on")
t.data.head(1)
[ ]:
t = Table("12211-Z-11")
t.get_data(quality="on") # not supported, ignored, but also no warning
t.data.head(1)
[ ]:
t = Table("1000A-0000")
t.get_data(quality="on")
t.data.head(1)