[ ]:
import logging

from pystatis import Table

logging.basicConfig(level=logging.INFO)

The Table Class

The Table class in pystatis is the main interface for users to interact with the different databases and download the data/tables in form of pandas DataFrames.

To use the class, you have to pass only a single parameter: the name of the table you want to download.

[ ]:
t = Table(name="81000-0001")

Downloading data

However, creating a new Table instance does not automatically retrieve the data from the database (or cache). Instead, you have to call another method: get_data(). The reason for this decision was to give you full control over the download process and avoid unnecessary downloads of big tables unless you are certain you want to start the download.

[ ]:
t.get_data()

You can access the name of a table via the .name attribute.

[ ]:
t.name

After a successful download (or cache retrieval), you can always access the raw data, that is the original response from the web API as a string, via the .raw_data attribute.

[ ]:
print(t.raw_data)

More likely, you are interested in the pandas DataFrame, which is accessible via the .data attribute.

[ ]:
t.data.head()

Finally, you can also access the metadata for this table via the .metadata attribute.

[ ]:
from pprint import pprint

pprint(t.metadata)

How pystatis prepares the data for you

As you can notice from a comparison between the .raw_data and .data formats, pystatis is doing a lot behind the scenes to provide you with a format that is hopefully the most useful for you. You will see and learn that there are a few parameters that you can use to actually change this behavior and adjust the table to your needs.

But first we would like to explain to you how pystatis is preparing the data by default so you have a better understanding of the underlying process.

When we look at the header of the raw data, we can notice a few things:

  • Many columns always come in a pair of *_Code and *_Label columns. Both contain the same information, only provided differently.

  • There are columns that don’t have a direct use as they contain information not needed in the table, like the Statistik_Code and Statistik_Label columns at the beginning. You already know the statistic from the name of the table and this information is the same for each and every row anyway.

  • There is always a time dimension, broken down into three different columns Zeit_Code, Zeit_Label and Zeit (or time_* in English).

  • The other dimensions are called variables (German “Merkmale”) and they always come in groups of four columns: N_Merkmal_Code, N_Merkmal_Label, N_Auspraegung_Code, and N_Auspraegung_Label (English: variable code and label and variable value code and label).

  • The actual measurements or values are at the end of the table after the variables and each measurement has one column. The name of this column follows the format <CODE>__<LABEL>__<UNIT>, e.g. “BWS001Bruttowertschoepfungjew._ME”. “BWS001” is the unique code for this variable, “Bruttowertschoepfung” is the human readable label of the variable, and “jew._ME” is the unit the measurement was recorded in.

Note This is only true for tables from Genesis and Regionalstatistik, the format of the Zensus tables is noticeably different from this. However, we follow a similar approach to provide you the same convenient output format.

The following table hopefully makes it a little bit clearer what is happening when going from the raw data string to the pandas DataFrame. The example is showing the Table “11111-02-01-4” from Regionalstatistik, but remember, that Genesis and Regionalstatistik have identically formats. The table has a time dimension, one attribute and one value.

Statistik_Code

Statistik_Label

Zeit_Code

Zeit_Label

Zeit

1_Merkmal_Code

1_Merkmal_Label

1_Auspraegung_Code

1_Auspraegung_Label

GEM001Zahl_der_GemeindenAnzahl

11111

Feststellung des Gebietsstandes

STAG

Stichtag

31.12.2022

KREISE

Kreise und kreisfreie Städte

DG

Deutschland

10786

11111

Feststellung des Gebietsstandes

STAG

Stichtag

31.12.2022

KREISE

Kreise und kreisfreie Städte

01

Schleswig-Holstein

1106

The same table has the following pandas representation after being “prettified” by pystatis:

[ ]:
t = Table("11111-02-01-4")
t.get_data()
t.data.head(2)

As you can see and hopefully agree, the pandas version (what we call “prettified”) provides the same information, actually even more, because the header column names have become meaningful and there is a lot less noise that you need to filter out before you can get to the actual data.

For Zensus pystatis is basically doing the same, but in a slightly different way because since the release of Zensus 2022 the API no longer returns each measurement as a single column but only a single column for all values. pystatis is transforming this long data format back into a wide data format, so you can work with a tidy data set. See the following example of Table “4000W-1002” to understand what is going on.

statistics_code

statistics_label

time_code

time_label

time

1_variable_code

1_variable_label

1_variable_attribute_code

1_variable_attribute_label

2_variable_code

2_variable_label

2_variable_attribute_code

2_variable_attribute_label

value

value_unit

value_variable_code

value_variable_label

4000W

Wohnungen (Gebietsstand 15.05.2022)

STAG

Stichtag

2022-05-15

GEODL1

Deutschland

DG

Deutschland

WHGFL2

Fläche der Wohnung (10 m²-Intervalle)

WFL170B179

170 - 179 m²

1,2

%

WHG002

Wohnungen in Gebäuden mit Wohnraum

4000W

Wohnungen (Gebietsstand 15.05.2022)

STAG

Stichtag

2022-05-15

GEODL1

Deutschland

DG

Deutschland

WHGFL2

Fläche der Wohnung (10 m²-Intervalle)

WFL170B179

170 - 179 m²

509041

Anzahl

WHG002

Wohnungen in Gebäuden mit Wohnraum

4000W

Wohnungen (Gebietsstand 15.05.2022)

STAG

Stichtag

2022-05-15

GEODL1

Deutschland

DG

Deutschland

WHGFL2

Fläche der Wohnung (10 m²-Intervalle)

WFL090B099

90 - 99 m²

7,2

%

WHG002

Wohnungen in Gebäuden mit Wohnraum

4000W

Wohnungen (Gebietsstand 15.05.2022)

STAG

Stichtag

2022-05-15

GEODL1

Deutschland

DG

Deutschland

WHGFL2

Fläche der Wohnung (10 m²-Intervalle)

WFL090B099

90 - 99 m²

3082890

Anzahl

WHG002

Wohnungen in Gebäuden mit Wohnraum

[ ]:
t = Table("4000W-1002")
t.get_data()
t.data.head(2)

As you can see, pystatis is not only increasing readability and making data access easy, it also reduces the amount of data you have to work with. Going from a long format back to a tidy wide format means cutting the number of rows to 1/3 because all three measurements get back their own column.

pystatis is doing the following things (by default) when parsing the original raw string:

  • remove the information about the statistic

  • for all variables: only keep the value column and choose the variable label as the column name

  • for all measurements: remove the variable code from the column name, only keep label and unit

  • set the proper data types (datetime for the time variable, if appropriate; str for regional codes)

  • handling missing values (i.e. replacing characters “…”, “.”, “-”, “/” and “x” by proper NaN values) and special characters

  • choosing the right decimal character depending on the specified language (German: “,”, English: “.”)

All of this happens behind the scenes when you are downloading the data with get_data() and access it via the Table.data attribute.

All get_data() parameters explained

You can find a list of all parameters in the documentation or in the docstring. All parameters are keyword parameters only (fancy Python star syntax: f(*, everything from here on has to be a keyword only parameter)).

[ ]:
?t.get_data

prettify

prettify is a boolean and can only be True or False. The default is True because prettify is basically doing all the above mentioned work behind the scenes to transform the raw data into the nicer tidy version. However, as we don’t know what specific requirements you have, it can always be the case that we are not doing what you want to do or we are doing it in a wrong way. Instead of starting from scratch with the raw string, prettify=False will still give you a pandas DataFrame but without the transformations described in the previous sections. Basically, prettify=False gives you the raw data as a pandas DataFrame instead of a string without any transformation from our side.

[ ]:
t = Table("1000A-0000")
t.get_data(prettify=False)
t.data.head(3)
[ ]:
# don't be confused by the query, we have to query by ARS in this example because prettify=True sorts the data by ARS and the order is different from above
t = Table("1000A-0000")
t.get_data(prettify=True)
t.data[
    t.data["Amtlicher Regionalschlüssel (ARS)"].isin(
        ["092760130130", "073355011022", "130765654053"]
    )
]
[ ]:
t.data.info()

area

We don’t have a good explanation for this one, so if you have a concrete use case, please let us know!

Here is the description from the official documentation:

The area query parameter specifies the area in which the object is stored, which is analogous to online navigation. Here is the breakdown:

For internal users:

  • Meine/Benutzer

  • Gruppe

  • Amt

  • Katalog/Öffentlich

  • Alle

For external users:

  • Meine/Benutzer

  • Katalog/Öffentlich

This parameter corresponds to:

  • Bereich=Benutzer as Bereich=Meine

  • Bereich=Öffentlich as Bereich=Katalog

startyear, endyear and timeslices

All three parameters can be used to fetch data of a certain time range for the given Table. The default is Table specific and has to be checked for each Table, often it is just the latest period of time available.

The important thing here is that timeslices is cumulative to the other two options, meaning that timeslices=N will give you N years after startyear or before endyear.

Let’s say you are interested in school-leaving qualifications over the years in Germany. Then Table 21111-0004 might be of interest to you. The description of the table mentions that data is available for the years 1997/98 - 2021/22. But what will the API return if you specify no time parameter?

[ ]:
t = Table("21111-0004")
t.get_data()
t.data["Schuljahr"].unique()

As you can see, pystatis only returns you, for whatever reason, the years 2020/21 and 2021/22. How can you get the ten latest years? Let’s see:

[ ]:
t.get_data(timeslices=10)
t.data["Schuljahr"].unique()
[ ]:
t.get_data(startyear="2012")
t.data["Schuljahr"].unique()

If you are only interested in a time period somewhere in between, you need to use both startyear and endyear:

[ ]:
t.get_data(startyear="2012", endyear="2015")
t.data["Schuljahr"].unique()

You might expect that using startyear and timeslices might give the same result, but it turns out that this is not the case and quite misleading. In fact, timeslices is always coming on top of whatever you have selected with startyear and endyear. Is that confusing? We definitely think so!

[ ]:
t.get_data(
    startyear="2012", endyear="2015", timeslices=3
)  # gives everything between 2012 and 2015 three more years
t.data["Schuljahr"].unique()
[ ]:
t.get_data(
    endyear="2015", timeslices=3
)  # gives everything up to 2015 and three more years
t.data["Schuljahr"].unique()

regionalvariable and regionalkey

Tables that end with a “B” in Regionalstatistik are special: They allow to change the regional depth of the data, meaning that you can fetch data for different regional areas depending on these two variables. The same is true for all Zensus tables.

To select a specific region area, you can either specify regionalvariable and pass one of the reserved codes for this geo variable, or you can directly select a specific region via its key. Let’s see some examples, so let’s analyze Table 12613-01-01-5-B:

[ ]:
t = Table("12613-01-01-5-B")
t.get_data()
t.data.head(5)

Instead of fetching the data for all municipalities, we can choose a different regional depth (see the codes here), for example “KRESIE”, one level above “GEMEINDE”, which is the default for this table.

[ ]:
t = Table("12613-01-01-5-B")
t.get_data(regionalvariable="KREISE")
t.data.head(5)

regionalkey can be used to fetch only certain areas, see https://datengui.de/statistik-erklaert/ags. We now fetch only municipalities in Baden-Württemberg:

[ ]:
t = Table("12613-01-01-5-B")
t.get_data(regionalkey="08*")
t.data.head(5)

stand

Can be used to only download tables that have a version newer than the given date.

[ ]:
t = Table("21111-0004")
t.get_data()
t.data.head(5)
[ ]:
t.metadata["Object"]["Updated"]
[ ]:
t.get_data(stand="01.01.2023")  # before updated date, so should return data
t.data.head()
[ ]:
t.get_data(stand="01.12.2024")  # after updated date, so error
t.data.head()

language

language can either be “de” or “en, with “de” being the default, obviously. Regionalstatistik is not supporting “en” and will not translate any data, Genesis and Zensus have some support for English, but you have to check for yourself, if the data is translated and to what extend.

[ ]:
t = Table("81000-0001")
t.get_data()
t.data.head(1)
[ ]:
t = Table("81000-0001")
t.get_data(language="en")
t.data.head(1)

quality

quality can be either “on” or “off”, with “off” being the default. When switching to “on”, the downloaded table has additional quality columns “__q” for each value column with quality symbols. Check Explanation of symbols Not supported for all tables or databases.

[ ]:
t = Table("52111-0001")
t.get_data(quality="on")
t.data.head(1)
[ ]:
t = Table("12211-Z-11")
t.get_data(quality="on")  # not supported, ignored, but also no warning
t.data.head(1)
[ ]:
t = Table("1000A-0000")
t.get_data(quality="on")
t.data.head(1)