this post was submitted on 25 Jun 2024
25 points (90.3% liked)
Python
6368 readers
1 users here now
Welcome to the Python community on the programming.dev Lemmy instance!
📅 Events
Past
November 2023
- PyCon Ireland 2023, 11-12th
- PyData Tel Aviv 2023 14th
October 2023
- PyConES Canarias 2023, 6-8th
- DjangoCon US 2023, 16-20th (!django 💬)
July 2023
- PyDelhi Meetup, 2nd
- PyCon Israel, 4-5th
- DFW Pythoneers, 6th
- Django Girls Abraka, 6-7th
- SciPy 2023 10-16th, Austin
- IndyPy, 11th
- Leipzig Python User Group, 11th
- Austin Python, 12th
- EuroPython 2023, 17-23rd
- Austin Python: Evening of Coding, 18th
- PyHEP.dev 2023 - "Python in HEP" Developer's Workshop, 25th
August 2023
- PyLadies Dublin, 15th
- EuroSciPy 2023, 14-18th
September 2023
- PyData Amsterdam, 14-16th
- PyCon UK, 22nd - 25th
🐍 Python project:
- Python
- Documentation
- News & Blog
- Python Planet blog aggregator
💓 Python Community:
- #python IRC for general questions
- #python-dev IRC for CPython developers
- PySlackers Slack channel
- Python Discord server
- Python Weekly newsletters
- Mailing lists
- Forum
✨ Python Ecosystem:
🌌 Fediverse
Communities
- #python on Mastodon
- c/django on programming.dev
- c/pythorhead on lemmy.dbzer0.com
Projects
- Pythörhead: a Python library for interacting with Lemmy
- Plemmy: a Python package for accessing the Lemmy API
- pylemmy pylemmy enables simple access to Lemmy's API with Python
- mastodon.py, a Python wrapper for the Mastodon API
Feeds
founded 1 year ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
Here “database” seems to mean a pandas dataframe. Sounds like you need to create a database using Postgres or sqlite or something similar, and recreate that database from a backup or database dump whenever you need it. You could host that database in the cloud or on your own network as well, if you need access remotely.
For instance see this pandas doc https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_sql.html
Thanks, I could solve it creating a file with a def get_database(name):
if name == 'database':
all the process to create the database
return database
And then df = get_database('database') execute all the processes and return it.
I am a little curious about the conditional. I have a suspicion that this is a bit of over engineering.
The problem you seem to be trying to solve is “I need to access the same data in multiple ways, places, or projects.” That’s what a database is really great for. However, if you just need to combine the same csv files you have on disk over and over, why not combine them and dump the output to a csv? Next time you need it, just load the combined csv. FWIW this is loosely what SQLite is doing.
If you are defining a method or function that performs these ETL operations over and over, and the underlying data is not changing, I think updating your local files to be the desired content and format is actually what you want.
If instead you’re trying to explore modules, imports, abstraction, writing DRY code, or other software development fundamentals- great! Play around, it’s a great way to learn how might also recommend picking up some books! Usually your local library has some books on Python development, object oriented programming, and data engineering basics that you might find fascinating (as I have)
My local library gives me access to O'Reilly Online, so free textbook access for just about any topic.
There's some data that comes in CSV, other are database files, in the SQL server, excel or web apis. From some of them I need to combine multiple sources with different formags even.
I guess I could have a database with everything more tidy, easier to use, secure and with less failure ratio. I'm still going to prepare the databases (I'm thinking on dataframe objects on a pickle, but I want to experiment with parquetd) so they don't have to be processed every time, but I wanted something I could just write the name of the database and get the update version.
This sounds kind of like a data warehouse. Depending on the size of the data and number of connections I’d say script or database or module, this is a much bigger problem. Look into dbt (data build tool) and airflow
I have a Datawerehouse some of the dabases I got come from there, but can only be accessed in the virtual machine.
I would say consider having a script that combines all these sources into a single data mart for your monthly reports. Could also be useful for the ad hoc studies, but idk how much of the same fields you're using for these studies.
What are you trying to output in the end (dashboard? Report? Table?), how often are these inputs coming in, and how often do you run your process?
There's some reports that need to be run monthly, they need to be edited each month to add the directories with the new databases and it causes problems, some of them im trying to solve with this. There's also a lot of ad hoc statistics studies I need to do, that use the same bases.
It does sound to me like ingesting all these different formats into a normalized database (aka data warehousing) and then building your tools to report from that centralized warehouse is the way to go. Your warehouse could also track ingestion dates, original format converted from, etc. and then your tools only need to know that one source of truth.
Is there any reason not to build this as a two-step process of 1) ingestion to a central database and 2) reporting from said database?