Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: kaggle refactor #489

Merged
merged 24 commits into from
Nov 20, 2024
Merged

feat: kaggle refactor #489

merged 24 commits into from
Nov 20, 2024

Conversation

you-n-g
Copy link
Contributor

@you-n-g you-n-g commented Nov 15, 2024

Task: new kaggle mechanism; template from scratch

small size data

[[rdagent/scenarios/kaggle/tpl_ex/aerial-cactus-identification/main.py:235]]
[[rdagent/scenarios/kaggle/tpl_ex/aerial-cactus-identification/load_data.py:55]]

deprecated:[[rdagent/scenarios/kaggle/tpl_ex/aerial-cactus-identification/train.py:18]]
Sample data code:

from pathlib import Path
import pandas as pd
from rdagent.app.kaggle.conf import KAGGLE_IMPLEMENT_SETTING

def create_debug_data(competition = "new-york-city-taxi-fare-prediction", min_frac=0.05, min_num=100):
    # Define the competition name

    # Define the path to the CSV file
    csv_path = Path(KAGGLE_IMPLEMENT_SETTING.Local_data_path) / competition / "train.csv"

    # Define the path to the .full CSV file
    full_csv_path = csv_path.with_name("train.full.csv")

    # Check if the .full file exists
    if not full_csv_path.exists():
    # Load the CSV file
    df = pd.read_csv(csv_path)

    # Calculate the fraction to sample
    frac = max(min_frac, min_num / len(df))

    # Sample the data
    df_sampled = df.sample(frac=frac, random_state=1)

    # Save the sampled data to a new CSV file
    sampled_csv_path = csv_path.with_name("train_sampled.csv")
    df_sampled.to_csv(sampled_csv_path, index=False)

    # Rename the original file with .full
    csv_path.rename(full_csv_path)

    # Move the sampled data to replace the original one
    sampled_csv_path.rename(csv_path)

import fire
if __name__ == "__main__":
    fire.Fire(create_debug_data)

Config

To successfully run it, we temporary use the default kaggle image via pulling
Here is the example config from xiao

KG_LOCAL_DATA_PATH=
KG_IF_USING_MLE_DATA=True
 
KG_DOCKER_BUILD_FROM_DOCKERFILE=False
KG_DOCKER_IMAGE="gcr.io/kaggle-gpu-images/python:latest"
KG_DOCKER_DEFAULT_ENTRY="sh -c 'python main.py; sleep 200'"

TODO:

  • Align the path to kaggle would be better. replace the "kg_workspace"
  • unzip the internal content in the package

Description

Motivation and Context

How Has This Been Tested?

  • Pass the test by running: pytest qlib/tests/test_all_pipeline.py under upper directory of qlib.
  • If you are adding a new feature, test on your own test scripts.

Screenshots of Test Results (if appropriate):

  1. Pipeline test:
  2. Your own tests:

Types of changes

  • Fix bugs
  • Add new feature
  • Update documentation

📚 Documentation preview 📚: https://RDAgent--489.org.readthedocs.build/en/489/

@you-n-g you-n-g marked this pull request as draft November 19, 2024 03:42
@XianBW XianBW marked this pull request as ready for review November 20, 2024 08:30
@XianBW XianBW merged commit 1b057d0 into main Nov 20, 2024
8 checks passed
@XianBW XianBW deleted the kaggle_refactor branch November 20, 2024 09:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants