{ "cells": [ { "cell_type": "code", "execution_count": null, "id": "165da3a1", "metadata": { "scrolled": true }, "outputs": [], "source": [ "import logging\n", "\n", "from pystatis import Table\n", "\n", "logging.basicConfig(level=logging.INFO)" ] }, { "cell_type": "markdown", "id": "8e14f4db", "metadata": {}, "source": [ "# The `Table` Class\n", "\n", "The `Table` class in `pystatis` is the main interface for users to interact with the different databases and download the data/tables in form of `pandas` `DataFrames`.\n" ] }, { "cell_type": "markdown", "id": "07f3dee4", "metadata": {}, "source": [ "To use the class, you have to pass only a single parameter: the `name` of the table you want to download.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "5d25c79a", "metadata": {}, "outputs": [], "source": [ "t = Table(name=\"81000-0001\")" ] }, { "cell_type": "markdown", "id": "8ca8127a", "metadata": {}, "source": [ "## Downloading data\n" ] }, { "cell_type": "markdown", "id": "3d841f94", "metadata": {}, "source": [ "However, creating a new `Table` instance does not automatically retrieve the data from the database (or cache). Instead, you have to call another method: `get_data()`. The reason for this decision was to give you full control over the download process and avoid unnecessary downloads of big tables unless you are certain you want to start the download.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "632fc783", "metadata": { "scrolled": false }, "outputs": [], "source": [ "t.get_data()" ] }, { "cell_type": "markdown", "id": "2370bd5e", "metadata": {}, "source": [ "You can access the name of a table via the `.name` attribute.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "f5f1aded", "metadata": {}, "outputs": [], "source": [ "t.name" ] }, { "cell_type": "markdown", "id": "4e050eed", "metadata": {}, "source": [ "After a successful download (or cache retrieval), you can always access the raw data, that is the original response from the web API as a string, via the `.raw_data` attribute.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "8fede338", "metadata": {}, "outputs": [], "source": [ "print(t.raw_data)" ] }, { "cell_type": "markdown", "id": "5ae90416", "metadata": {}, "source": [ "More likely, you are interested in the `pandas` `DataFrame`, which is accessible via the `.data` attribute.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "874bbbb9", "metadata": {}, "outputs": [], "source": [ "t.data.head()" ] }, { "cell_type": "markdown", "id": "677d68b7", "metadata": {}, "source": [ "Finally, you can also access the metadata for this table via the `.metadata` attribute.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "5f3672e9", "metadata": {}, "outputs": [], "source": [ "from pprint import pprint\n", "\n", "pprint(t.metadata)" ] }, { "cell_type": "markdown", "id": "953d7cb2", "metadata": {}, "source": [ "## How `pystatis` prepares the data for you\n" ] }, { "cell_type": "markdown", "id": "22b075b6", "metadata": {}, "source": [ "As you can notice from a comparison between the `.raw_data` and `.data` formats, `pystatis` is doing a lot behind the scenes to provide you with a format that is hopefully the most useful for you. You will see and learn that there are a few parameters that you can use to actually change this behavior and adjust the table to your needs.\n", "\n", "But first we would like to explain to you how `pystatis` is preparing the data by default so you have a better understanding of the underlying process.\n" ] }, { "cell_type": "markdown", "id": "889e27db", "metadata": {}, "source": [ "When we look at the header of the raw data, we can notice a few things:\n", "\n", "- Many columns always come in a pair of `*_Code` and `*_Label` columns. Both contain the same information, only provided differently.\n", "- There are columns that don't have a direct use as they contain information not needed in the table, like the `Statistik_Code` and `Statistik_Label` columns at the beginning. You already know the statistic from the name of the table and this information is the same for each and every row anyway.\n", "- There is always a time dimension, broken down into three different columns `Zeit_Code`, `Zeit_Label` and `Zeit` (or `time_*` in English).\n", "- The other dimensions are called variables (German \"Merkmale\") and they always come in groups of four columns: `N_Merkmal_Code`, `N_Merkmal_Label`, `N_Auspraegung_Code`, and `N_Auspraegung_Label` (English: variable code and label and variable value code and label).\n", "- The actual measurements or values are at the end of the table after the variables and each measurement has one column. The name of this column follows the format `__