Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: ClickHouse import automation PE-7137 #241

Draft
wants to merge 4 commits into
base: develop
Choose a base branch
from

Conversation

djwhitt
Copy link
Collaborator

@djwhitt djwhitt commented Nov 27, 2024

No description provided.

…E-7137

Adding min/max data item heights to debug endpoint output both for
general utility and to track what height ranges are available to export
as Parquet.
This index will be used to quickly determine what stable data items to
delete after Parquet exports and imports complete.
This will be used by the Parquet => ClickHouse export automation to
cleanup data items that have already been exported.
…ckHouse import PE-7137

Adds a clickhouse-auto-import script that loops forever exporting
Parquet from the SQLite DB, importing it into ClickHouse, moving it to
another directory for potential archival, and pruning old data from
SQLite.

When using Docker Compose, the script is run automatically in a separate
service when the 'clickhouse' profile is in use.

Currently the script is written in bash. In the future we may integrate
it into the core service TypeScript codebase, but having a standalone
script initially allows for faster iteration.
@djwhitt djwhitt marked this pull request as draft November 27, 2024 16:20
Copy link

codecov bot commented Nov 27, 2024

Codecov Report

Attention: Patch coverage is 36.36364% with 14 lines in your changes missing coverage. Please review.

Project coverage is 70.08%. Comparing base (c40ed8e) to head (771083c).
Report is 9 commits behind head on develop.

Files with missing lines Patch % Lines
src/database/standalone-sqlite.ts 36.36% 14 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop     #241      +/-   ##
===========================================
- Coverage    70.17%   70.08%   -0.10%     
===========================================
  Files           33       33              
  Lines         8211     8233      +22     
  Branches       455      460       +5     
===========================================
+ Hits          5762     5770       +8     
- Misses        2447     2461      +14     
  Partials         2        2              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

coderabbitai bot commented Nov 27, 2024

📝 Walkthrough

Walkthrough

This pull request introduces a new Dockerfile for setting up a ClickHouse auto-import environment, along with modifications to the docker-compose.yaml to add a new service for this functionality. It also includes the creation of a new SQL index, updates to existing SQL commands, and the addition of scripts for data import automation. Furthermore, it enhances the database management capabilities by adding methods for pruning stable data items and introducing a new API endpoint for this operation.

Changes

File Change Summary
Dockerfile.clickhouse-auto-import New Dockerfile created for ClickHouse auto-import setup, includes installation of ClickHouse and script setup.
docker-compose.yaml Added new service clickhouse-auto-import, updated core service with new volume mapping, and modified clickhouse service for network configuration.
migrations/2024.11.21T20.28.45.bundles.data-items-indexed-at-index.sql New index stable_data_items_indexed_at_idx created on indexed_at column of stable_data_items table.
migrations/down/2024.11.21T20.28.45.bundles.data-items-indexed-at-index.sql SQL command added to drop index stable_data_items_indexed_at_idx.
scripts/clickhouse-auto-import New script for automating data export from an API to Parquet files and importing into ClickHouse.
scripts/clickhouse-import Modified to include clickhouse_host variable for connecting to specified ClickHouse host.
src/database/sql/bundles/cleanup.sql New SQL command to delete records from stable_data_items where indexed_at is below a threshold.
src/database/sql/bundles/stats.sql Enhanced SQL queries to include minimum and maximum height metrics in data item statistics.
src/database/standalone-sqlite.ts New methods pruneStableDataItems added for deleting stable data items based on indexed timestamp.
src/routes/ar-io.ts New endpoint POST /ar-io/admin/prune-stable-data-items added for pruning stable data items, with validation for input.

Possibly related PRs

Suggested reviewers

  • karlprieb

Warning

There were issues while running some tools. Please review the errors and either fix the tool’s configuration or disable the tool if it’s a critical failure.

🔧 eslint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

error @permaweb/[email protected]: The engine "yarn" is incompatible with this module. Expected version "please-use-npm". Got "1.22.22"
error Found incompatible module.


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 14

🧹 Outside diff range and nitpick comments (9)
src/database/sql/bundles/cleanup.sql (1)

16-17: Consider handling NULL values explicitly

The current WHERE clause might delete rows where indexed_at is NULL. If this is not intended, consider adding an explicit NULL check.

DELETE FROM stable_data_items
-WHERE indexed_at < @indexed_at_threshold
+WHERE indexed_at IS NOT NULL AND indexed_at < @indexed_at_threshold
Dockerfile.clickhouse-auto-import (2)

11-13: Consider documenting directory structure and setting explicit permissions

While the directory structure is created, its purpose and permissions should be more clearly defined.

Consider this enhancement:

 # Create necessary directories
 WORKDIR /app
-RUN mkdir -p data/parquet/imported
+# Create directories with appropriate permissions for imported data
+RUN mkdir -p data/parquet/imported && \
+    chown -R nobody:nogroup data && \
+    chmod -R 755 data

Also, consider adding a comment explaining the purpose of each directory in the structure.


1-26: Consider implementing health checks and logging strategy

While the basic functionality is in place, the Dockerfile could benefit from additional operational features:

  1. Add HEALTHCHECK instruction to monitor the import process
  2. Implement structured logging
  3. Consider adding a volume for persistent data
  4. Add container labels for metadata

Example health check implementation:

+# Add monitoring capabilities
+HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
+    CMD curl -f http://localhost:8123/ping || exit 1
+
+# Add metadata
+LABEL maintainer="AR.IO Team" \
+      description="ClickHouse auto-import service" \
+      version="1.0"
scripts/clickhouse-import (2)

Line range hint 6-18: Fix potential security and reliability issues in file handling

Several issues need attention:

  1. Using ls for filename matching is unsafe
  2. Missing file existence validation
  3. SQL injection vulnerability through filenames
  4. No error handling for failed imports

Apply these improvements:

-for blocks_parquet in "$parquet_path/blocks"*.parquet; do
+# Use find instead of shell globbing for safer file handling
+while IFS= read -r blocks_parquet; do
+    # Validate file exists and is readable
+    if [[ ! -r "$blocks_parquet" ]]; then
+        echo "Error: Cannot read $blocks_parquet" >&2
+        continue
+    }
+
     height_range=$(basename "$blocks_parquet" | sed 's/blocks-//;s/-rowCount:[0-9]*\.parquet//')
-    txs_parquet=$(ls data/parquet/transactions-"$height_range"-*.parquet)
-    tags_parquet=$(ls data/parquet/tags-"$height_range"-*.parquet)
+    # Use find for safer file matching
+    txs_parquet=$(find "$parquet_path" -type f -name "transactions-${height_range}-*.parquet" -print -quit)
+    tags_parquet=$(find "$parquet_path" -type f -name "tags-${height_range}-*.parquet" -print -quit)
+
+    # Validate all required files exist
+    if [[ ! -r "$txs_parquet" || ! -r "$tags_parquet" ]]; then
+        echo "Error: Missing transaction or tag files for height range $height_range" >&2
+        continue
+    }
+
+    # Escape filenames for SQL safety
+    blocks_parquet_escaped="${blocks_parquet//\'/\'\'}"
+    txs_parquet_escaped="${txs_parquet//\'/\'\'}"
+    tags_parquet_escaped="${tags_parquet//\'/\'\'}"

     echo "Importing $blocks_parquet..."
-    clickhouse client --host "$clickhouse_host" --query="INSERT INTO staging_blocks FROM INFILE '$blocks_parquet' FORMAT Parquet;"
+    if ! clickhouse client --host "$clickhouse_host" --query="INSERT INTO staging_blocks FROM INFILE '$blocks_parquet_escaped' FORMAT Parquet"; then
+        echo "Error: Failed to import blocks file $blocks_parquet" >&2
+        continue
+    fi
-done
+done < <(find "$parquet_path" -type f -name "blocks*.parquet")

Line range hint 20-56: SQL query looks good with room for optimization

The SQL query is well-structured and handles the data transformation correctly. Consider adding an ORDER BY clause to potentially improve the groupArray performance if the order of tags is important.

Optional optimization:

   END AS tags
 FROM staging_transactions txs
 LEFT JOIN staging_tags tags ON txs.height = tags.height AND txs.id = tags.id
 JOIN staging_blocks blocks ON txs.height = blocks.height
-GROUP BY 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, tags.id
+GROUP BY 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, tags.id
+ORDER BY tags.tag_index
docker-compose.yaml (1)

152-153: Consider adding volume for logs

To aid in debugging and monitoring, consider adding a volume mapping for logs.

     volumes:
       - ${PARQUET_DATA_PATH:-./data/parquet}:/app/data/parquet
+      - ${CLICKHOUSE_AUTO_IMPORT_LOGS_PATH:-./logs/clickhouse-auto-import}:/app/logs
src/routes/ar-io.ts (1)

317-337: Consider adding metrics and progress tracking

For better operability, consider these improvements:

  1. Add metrics for monitoring (items deleted, time taken)
  2. Add progress tracking for long-running operations
  3. Add dry-run mode for safety

Here's how you might implement these features:

 arIoRouter.post(
   '/ar-io/admin/prune-stable-data-items',
   express.json(),
   async (req, res) => {
     try {
-      const { indexedAtThreshold } = req.body;
+      const { indexedAtThreshold, dryRun = false } = req.body;
 
       if (!Number.isInteger(indexedAtThreshold) || indexedAtThreshold < 0) {
         res
           .status(400)
           .send('Invalid indexedAtThreshold - must be a positive integer');
         return;
       }
 
+      const startTime = Date.now();
+      const result = await db.pruneStableDataItems(indexedAtThreshold, {
+        dryRun,
+        onProgress: (progress) => {
+          metrics.pruneProgress.set(progress);
+        },
+      });
-      await db.pruneStableDataItems(indexedAtThreshold);
+
+      const duration = Date.now() - startTime;
+      metrics.pruneOperationDuration.observe(duration);
+      metrics.itemsPruned.inc(result.itemsPruned);
+
       res.json({ message: 'Stable data items pruned successfully' });
     } catch (error: any) {
       res.status(500).send(error?.message);
     }
   },
 );
src/database/standalone-sqlite.ts (1)

2430-2435: Consider adding logging and batching.

The implementation is correct but could benefit from:

  1. Adding logging to track the number of deleted records
  2. Implementing batching for large deletions to prevent potential performance issues

Consider enhancing the implementation:

 pruneStableDataItems(indexedAtThreshold: number) {
+  const before = this.stmts.bundles.countStableDataItems.get().count;
   this.stmts.bundles.deleteStableDataItemsLessThanIndexedAt.run({
     indexed_at_threshold: indexedAtThreshold
   });
+  const after = this.stmts.bundles.countStableDataItems.get().count;
+  this.log.info('Pruned stable data items', {
+    deletedCount: before - after,
+    threshold: indexedAtThreshold
+  });
 }
scripts/clickhouse-auto-import (1)

36-36: Typo in comment: Correct 'inverals' to 'intervals'

In the comment on line 36, "inverals" should be "intervals" to fix the spelling error.

Apply this diff to fix the typo:

-# Align to inverals of 10,000
+# Align to intervals of 10,000
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 692c366 and 771083c.

📒 Files selected for processing (10)
  • Dockerfile.clickhouse-auto-import (1 hunks)
  • docker-compose.yaml (2 hunks)
  • migrations/2024.11.21T20.28.45.bundles.data-items-indexed-at-index.sql (1 hunks)
  • migrations/down/2024.11.21T20.28.45.bundles.data-items-indexed-at-index.sql (1 hunks)
  • scripts/clickhouse-auto-import (1 hunks)
  • scripts/clickhouse-import (3 hunks)
  • src/database/sql/bundles/cleanup.sql (1 hunks)
  • src/database/sql/bundles/stats.sql (3 hunks)
  • src/database/standalone-sqlite.ts (5 hunks)
  • src/routes/ar-io.ts (2 hunks)
✅ Files skipped from review due to trivial changes (2)
  • migrations/2024.11.21T20.28.45.bundles.data-items-indexed-at-index.sql
  • migrations/down/2024.11.21T20.28.45.bundles.data-items-indexed-at-index.sql
🔇 Additional comments (13)
src/database/sql/bundles/cleanup.sql (1)

16-17: Verify index existence for performance

The deletion query filters on the indexed_at column. To ensure optimal performance, verify that an index exists on this column in the stable_data_items table.

✅ Verification successful

Index exists for the indexed_at column

An index stable_data_items_indexed_at_idx has been created on the indexed_at column in the migration file migrations/2024.11.21T20.28.45.bundles.data-items-indexed-at-index.sql. The deletion query will be able to use this index for optimal performance.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check if an index exists on the indexed_at column of stable_data_items table

# Search for index creation statements
rg -A 2 "CREATE.*INDEX.*stable_data_items.*indexed_at"

# Search for table creation to verify indices
rg -A 10 "CREATE.*TABLE.*stable_data_items"

Length of output: 3963

Dockerfile.clickhouse-auto-import (1)

15-20: 🛠️ Refactor suggestion

Enhance script handling security

The current script handling could be improved for better security and reliability.

Consider these improvements:

-# Copy the auto-import script
+# Copy and verify import scripts
 COPY scripts/clickhouse-auto-import /app/scripts/
 COPY scripts/clickhouse-import /app/scripts/
 
-# Make scripts executable
-RUN chmod +x /app/scripts/clickhouse-auto-import /app/scripts/clickhouse-import
+# Set appropriate ownership and permissions
+RUN chown nobody:nogroup /app/scripts/clickhouse-* && \
+    chmod 500 /app/scripts/clickhouse-* && \
+    # Verify scripts exist and are executable
+    test -x /app/scripts/clickhouse-auto-import && \
+    test -x /app/scripts/clickhouse-import

Let's verify the referenced scripts exist in the repository:

✅ Verification successful

Scripts exist but ownership should be set to a non-root user

The scripts are present in the repository and contain appropriate functionality for ClickHouse data import automation. While the current chmod +x approach works, setting specific ownership and restrictive permissions would enhance security.

Recommended changes:

  • Set ownership to nobody:nogroup as these scripts don't require root privileges
  • Use more restrictive permissions (500) to prevent modification after container build
  • Add existence verification as an additional safety check
-# Copy the auto-import script
+# Copy and verify import scripts
 COPY scripts/clickhouse-auto-import /app/scripts/
 COPY scripts/clickhouse-import /app/scripts/
 
-# Make scripts executable
-RUN chmod +x /app/scripts/clickhouse-auto-import /app/scripts/clickhouse-import
+# Set appropriate ownership and permissions
+RUN chown nobody:nogroup /app/scripts/clickhouse-* && \
+    chmod 500 /app/scripts/clickhouse-* && \
+    # Verify scripts exist and are executable
+    test -x /app/scripts/clickhouse-auto-import && \
+    test -x /app/scripts/clickhouse-import
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Verify the existence and content of the import scripts
fd -t f "clickhouse-(auto-)?import$" -x cat {}

Length of output: 6239

src/database/sql/bundles/stats.sql (2)

17-21: LGTM! Well-structured addition of height metrics

The new height metrics follow consistent patterns:

  • Proper separation between new and stable items
  • Consistent use of IFNULL(-1) default values
  • Clear naming convention

17-21: Verify integration with pruning functionality

The new height metrics appear to support the pruning functionality mentioned in the AI summary. However, let's verify the height thresholds are consistently used across the codebase.

Also applies to: 32-36, 48-52

✅ Verification successful

Let me gather more information to verify the usage of these height metrics in the context of pruning functionality.


Let me check one more aspect to understand how these height metrics are used in the SQL queries and bundle processing.


Based on the comprehensive analysis of the codebase, I can now provide the final response:

Height metrics are correctly integrated with pruning functionality

The height metrics in the stats query are used to:

  • Track data chronology through min_new_height, max_new_height, min_stable_height, and max_stable_height
  • Support pruning via pruneStableDataItems which uses indexed_at_threshold for cleanup
  • Maintain data consistency through height-based indices and cleanup operations

The implementation aligns with the pruning functionality as evidenced by:

  • Height-based indices in schema definitions
  • Pruning logic in src/database/sql/bundles/cleanup.sql
  • Height tracking in both new and stable data items tables
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for height-based pruning logic in related files

# Look for height-based pruning conditions
rg -A 5 "height.*prune|prune.*height" --type ts

# Check for other uses of these metrics
rg -A 5 "(min|max)_(new|stable)_height" --type ts

Length of output: 835


Script:

#!/bin/bash
# Search for pruning-related code and configurations
rg -A 10 "prune" --type ts --type js

# Search for height-related configurations
rg -A 5 "height.*config|config.*height" --type ts --type js

# Look for bundle-related pruning logic
ast-grep --pattern 'function $FUNC($$$) {
  $$$
  height
  $$$
  bundle
  $$$
}'

Length of output: 3352


Script:

#!/bin/bash
# Check SQL files for height-related queries
rg -A 10 "height" --type sql

# Look for bundle processing logic
ast-grep --pattern 'async function $FUNC($$$) {
  $$$
  bundle
  $$$
}'

# Check for any configuration files that might define height thresholds
fd -e json -e yaml -e yml -x cat {}

Length of output: 103877

scripts/clickhouse-import (1)

58-60: 🛠️ Refactor suggestion

Combine truncates into a single transactional operation

The separate TRUNCATE commands could lead to an inconsistent state if one fails. Additionally, multiple network round trips impact performance.

Combine the operations:

-    clickhouse client --host "$clickhouse_host" --query="TRUNCATE TABLE staging_blocks"
-    clickhouse client --host "$clickhouse_host" --query="TRUNCATE TABLE staging_transactions"
-    clickhouse client --host "$clickhouse_host" --query="TRUNCATE TABLE staging_tags"
+    cat <<EOF | clickhouse client --host "$clickhouse_host"
+START TRANSACTION;
+TRUNCATE TABLE staging_blocks;
+TRUNCATE TABLE staging_transactions;
+TRUNCATE TABLE staging_tags;
+COMMIT;
+EOF

Likely invalid or redundant comment.

docker-compose.yaml (3)

42-42: LGTM: Parquet volume mapping follows established patterns

The new volume mapping for Parquet data follows the project's conventions and provides a sensible default path.


141-142: LGTM: Network configuration aligns with service requirements

Adding the clickhouse service to ar-io-network enables proper communication with other services, particularly the new auto-import service.


159-159: Review admin API key usage

The service is using the admin API key for authentication. Consider:

  1. Creating a separate, limited-scope API key for this service
  2. Documenting the required API permissions
src/routes/ar-io.ts (1)

26-26: LGTM!

The import is correctly placed and properly used in the new endpoint.

src/database/standalone-sqlite.ts (4)

374-377: LGTM: New height fields for data items tracking.

The new fields follow the existing pattern and provide valuable debugging information about data item heights in both stable and new states.


1138-1141: LGTM: Proper implementation of data item height fields.

The implementation correctly retrieves values from dataItemStats with appropriate fallback to -1, consistent with other height fields.


3099-3102: LGTM: Proper worker queue implementation.

The method correctly queues the pruning operation to the bundles worker with appropriate write access and return type.


3288-3291: LGTM: Consistent message handler implementation.

The new case follows the established pattern for message handling and maintains the code organization.

Comment on lines +22 to +23
# Environment variables
ENV ADMIN_API_KEY=""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Document and validate environment variables

Setting ADMIN_API_KEY to an empty string could lead to security issues if not properly configured at runtime.

Consider adding validation and documentation:

-# Environment variables
-ENV ADMIN_API_KEY=""
+# Required environment variables
+# ADMIN_API_KEY: API key for authentication (required, non-empty)
+ENV ADMIN_API_KEY=
+
+# Validate environment variables at runtime
+COPY scripts/validate-env.sh /app/scripts/
+RUN chmod 500 /app/scripts/validate-env.sh
+ENTRYPOINT ["/app/scripts/validate-env.sh"]

Would you like me to help create the validate-env.sh script that checks for required environment variables at container startup?

Committable suggestion skipped: line range outside the PR's diff.

Comment on lines +25 to +26
# Run the auto-import script
CMD ["/app/scripts/clickhouse-auto-import"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Switch to non-root user and use ENTRYPOINT

The container currently runs as root, which is a security risk. Additionally, the startup command could be more robust.

Consider these security improvements:

+# Create non-root user for running the container
+RUN useradd -r -s /bin/false clickhouse
+
+# Switch to non-root user
+USER clickhouse
+
 # Run the auto-import script
-CMD ["/app/scripts/clickhouse-auto-import"]
+ENTRYPOINT ["/app/scripts/validate-env.sh"]
+CMD ["/app/scripts/clickhouse-auto-import"]

This change:

  1. Creates a dedicated non-root user
  2. Uses ENTRYPOINT for environment validation
  3. Maintains CMD for the primary process

Committable suggestion skipped: line range outside the PR's diff.

Comment on lines +4 to +9
RUN apt-get update && apt-get install -y \
curl \
jq \
&& rm -rf /var/lib/apt/lists/* \
&& curl https://clickhouse.com/ | sh \
&& mv clickhouse /usr/local/bin/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Critical: Unsafe installation method and missing version controls

The current installation method has several security concerns:

  1. Piping curl directly to shell is unsafe as it executes unverified code
  2. No version pinning for ClickHouse or system packages
  3. No checksum verification of downloaded content

Consider this safer alternative:

 RUN apt-get update && apt-get install -y \
-    curl \
-    jq \
+    curl=7.74.0-1.3+deb11u11 \
+    jq=1.6-2.1 \
     && rm -rf /var/lib/apt/lists/* \
-    && curl https://clickhouse.com/ | sh \
-    && mv clickhouse /usr/local/bin/
+    && CLICKHOUSE_VERSION="23.8.2.7" \
+    && CLICKHOUSE_PACKAGE="clickhouse-common-static-${CLICKHOUSE_VERSION}.tgz" \
+    && curl -fsSL "https://packages.clickhouse.com/tgz/${CLICKHOUSE_PACKAGE}" -o clickhouse.tgz \
+    && echo "${EXPECTED_HASH} clickhouse.tgz" | sha256sum -c \
+    && tar xvf clickhouse.tgz \
+    && cp ./clickhouse usr/local/bin/ \
+    && rm -rf clickhouse.tgz

Also consider:

  1. Using multi-stage builds to reduce final image size
  2. Adding HEALTHCHECK instruction to verify ClickHouse installation

Committable suggestion skipped: line range outside the PR's diff.

Comment on lines +32 to +36
null AS max_stable_indexed_at,
MIN(height) AS min_new_height,
null AS min_stable_height,
MAX(height) AS max_new_height,
null AS max_stable_height
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Standardize null vs -1 for unavailable data

There's an inconsistency in how unavailable data is represented between the subqueries:

  • new_data_items uses null for stable metrics
  • stable_data_items uses -1 for new metrics

This inconsistency could lead to confusion or bugs. Consider standardizing to either null or -1.

  SELECT
    COUNT(*) AS data_item_count,
    SUM(
      CASE
        WHEN parent_id != root_transaction_id THEN 1
        ELSE 0
      END
    ) AS nested_data_item_count,
    IFNULL(MAX(indexed_at), -1) AS max_new_indexed_at,
-   null AS max_stable_indexed_at,
+   -1 AS max_stable_indexed_at,
    MIN(height) AS min_new_height,
-   null AS min_stable_height,
+   -1 AS min_stable_height,
    MAX(height) AS max_new_height,
-   null AS max_stable_height
+   -1 AS max_stable_height

And similarly for the stable_data_items subquery:

    -1 AS max_new_indexed_at,
    IFNULL(MAX(indexed_at), -1) AS max_stable_indexed_at,
-   null AS min_new_height,
+   -1 AS min_new_height,
    MIN(height) AS min_stable_height,
-   null AS max_new_height,
+   -1 AS max_new_height,
    MAX(height) AS max_stable_height

Also applies to: 48-52

Comment on lines +3 to 4
clickhouse_host=${CLICKHOUSE_HOST:-localhost}
parquet_path=${PARQUET_PATH:-data/parquet}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add connection validation and error handling

While the host configuration is good, we should validate the connection before proceeding with imports to fail fast and provide clear error messages.

Add this validation after the variable declarations:

 clickhouse_host=${CLICKHOUSE_HOST:-localhost}
 parquet_path=${PARQUET_PATH:-data/parquet}
+
+# Validate ClickHouse connection
+if ! clickhouse client --host "$clickhouse_host" --query "SELECT 1" >/dev/null 2>&1; then
+    echo "Error: Unable to connect to ClickHouse at $clickhouse_host" >&2
+    exit 1
+fi
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
clickhouse_host=${CLICKHOUSE_HOST:-localhost}
parquet_path=${PARQUET_PATH:-data/parquet}
clickhouse_host=${CLICKHOUSE_HOST:-localhost}
parquet_path=${PARQUET_PATH:-data/parquet}
# Validate ClickHouse connection
if ! clickhouse client --host "$clickhouse_host" --query "SELECT 1" >/dev/null 2>&1; then
echo "Error: Unable to connect to ClickHouse at $clickhouse_host" >&2
exit 1
fi

done

# Import Parquet files
./scripts/clickhouse-import
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Add error handling for the ClickHouse import script execution

The script ./scripts/clickhouse-import is executed without checking for errors. If the import fails, subsequent steps may proceed incorrectly. Incorporate error handling to ensure failures are detected.

Apply this diff to check the exit status:

-./scripts/clickhouse-import
+if ! ./scripts/clickhouse-import; then
+    echo "ClickHouse import failed"
+    exit 1
+fi
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
./scripts/clickhouse-import
if ! ./scripts/clickhouse-import; then
echo "ClickHouse import failed"
exit 1
fi

./scripts/clickhouse-import

# Move processed files to imported directory
mv "$parquet_dir"/*.parquet "$imported_dir/"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Handle potential failures when moving Parquet files

The mv command may fail if there are no Parquet files to move or due to filesystem issues. Add error handling to manage such failures appropriately.

Apply this diff to add error checking:

-mv "$parquet_dir"/*.parquet "$imported_dir/"
+if ! mv "$parquet_dir"/*.parquet "$imported_dir/"; then
+    echo "Failed to move Parquet files to imported directory"
+    exit 1
+fi
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
mv "$parquet_dir"/*.parquet "$imported_dir/"
if ! mv "$parquet_dir"/*.parquet "$imported_dir/"; then
echo "Failed to move Parquet files to imported directory"
exit 1
fi

Comment on lines +84 to +87
curl -X POST "http://${ar_io_host}:${ar_io_port}/ar-io/admin/prune-stable-data-items" \
-H "Authorization: Bearer $ADMIN_API_KEY" \
-H "Content-Type: application/json" \
-d "{\"indexedAtThreshold\": $max_indexed_at}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Add error handling for pruning stable data items

The curl command used to prune stable data items does not check for HTTP errors. If the request fails, the script will proceed without detecting the failure. Add the -f flag and check the exit status to handle errors.

Apply this diff to add error checking:

-curl -X POST "http://${ar_io_host}:${ar_io_port}/ar-io/admin/prune-stable-data-items" \
+if ! curl -f -X POST "http://${ar_io_host}:${ar_io_port}/ar-io/admin/prune-stable-data-items" \
     -H "Authorization: Bearer $ADMIN_API_KEY" \
     -H "Content-Type: application/json" \
     -d "{\"indexedAtThreshold\": $max_indexed_at}"; then
+    echo "Pruning stable data items failed"
+    exit 1
+fi
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
curl -X POST "http://${ar_io_host}:${ar_io_port}/ar-io/admin/prune-stable-data-items" \
-H "Authorization: Bearer $ADMIN_API_KEY" \
-H "Content-Type: application/json" \
-d "{\"indexedAtThreshold\": $max_indexed_at}"
if ! curl -f -X POST "http://${ar_io_host}:${ar_io_port}/ar-io/admin/prune-stable-data-items" \
-H "Authorization: Bearer $ADMIN_API_KEY" \
-H "Content-Type: application/json" \
-d "{\"indexedAtThreshold\": $max_indexed_at}"; then
echo "Pruning stable data items failed"
exit 1
fi

Comment on lines +45 to +53
curl -X POST "http://${ar_io_host}:${ar_io_port}/ar-io/admin/export-parquet" \
-H "Authorization: Bearer $ADMIN_API_KEY" \
-H "Content-Type: application/json" \
-d "{
\"outputDir\": \"$parquet_dir\",
\"startHeight\": $current_height,
\"endHeight\": $end_height,
\"maxFileRows\": $max_rows_per_file
}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Add error handling for the curl command when exporting Parquet files

The curl command used to initiate the export process does not check for HTTP errors. If the request fails, the script will proceed without detecting the failure. Consider adding the -f flag to curl and checking the exit status to handle errors appropriately.

Apply this diff to add error checking:

-curl -X POST "http://${ar_io_host}:${ar_io_port}/ar-io/admin/export-parquet" \
+if ! curl -f -X POST "http://${ar_io_host}:${ar_io_port}/ar-io/admin/export-parquet" \
     -H "Authorization: Bearer $ADMIN_API_KEY" \
     -H "Content-Type: application/json" \
     -d "{
         \"outputDir\": \"$parquet_dir\",
         \"startHeight\": $current_height,
         \"endHeight\": $end_height,
         \"maxFileRows\": $max_rows_per_file
     }"; then
+    echo "Export request failed"
+    exit 1
+fi
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
curl -X POST "http://${ar_io_host}:${ar_io_port}/ar-io/admin/export-parquet" \
-H "Authorization: Bearer $ADMIN_API_KEY" \
-H "Content-Type: application/json" \
-d "{
\"outputDir\": \"$parquet_dir\",
\"startHeight\": $current_height,
\"endHeight\": $end_height,
\"maxFileRows\": $max_rows_per_file
}"
if ! curl -f -X POST "http://${ar_io_host}:${ar_io_port}/ar-io/admin/export-parquet" \
-H "Authorization: Bearer $ADMIN_API_KEY" \
-H "Content-Type: application/json" \
-d "{
\"outputDir\": \"$parquet_dir\",
\"startHeight\": $current_height,
\"endHeight\": $end_height,
\"maxFileRows\": $max_rows_per_file
}"; then
echo "Export request failed"
exit 1
fi

Comment on lines +56 to +75
while true; do
if ! status=$(curl -s -f -H "Authorization: Bearer $ADMIN_API_KEY" "http://${ar_io_host}:${ar_io_port}/ar-io/admin/export-parquet/status"); then
echo "Failed to get export status"
rm -f "$parquet_dir"/*.parquet
exit 1
fi

export_status=$(echo "$status" | jq -r '.status')
if [ "$export_status" = "completed" ]; then
break
elif [ "$export_status" = "errored" ]; then
error=$(echo "$status" | jq -r '.error')
echo "Export failed: $error"
rm -f "$parquet_dir"/*.parquet
exit 1
fi

echo "Waiting for export to complete..."
sleep 10
done
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Implement a timeout to avoid infinite wait during export completion

The inner loop waiting for the export to complete lacks a timeout mechanism. If the export process hangs indefinitely, the script could enter an infinite loop. Introduce a maximum retry count to prevent this scenario.

Here's an example modification:

+# Set maximum number of retries (e.g., 30 retries with 10-second intervals = 5 minutes)
+max_retries=30
+retry_count=0

 # Wait for the export to complete
 while true; do
     if ! status=$(curl -s -f -H "Authorization: Bearer $ADMIN_API_KEY" "http://${ar_io_host}:${ar_io_port}/ar-io/admin/export-parquet/status"); then
         echo "Failed to get export status"
         rm -f "$parquet_dir"/*.parquet
         exit 1
     fi

     export_status=$(echo "$status" | jq -r '.status')
     if [ "$export_status" = "completed" ]; then
         break
     elif [ "$export_status" = "errored" ]; then
         error=$(echo "$status" | jq -r '.error')
         echo "Export failed: $error"
         rm -f "$parquet_dir"/*.parquet
         exit 1
     fi

     echo "Waiting for export to complete..."
     sleep 10
+    retry_count=$((retry_count + 1))
+    if [ "$retry_count" -ge "$max_retries" ]; then
+        echo "Export did not complete within expected time"
+        rm -f "$parquet_dir"/*.parquet
+        exit 1
+    fi
 done
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
while true; do
if ! status=$(curl -s -f -H "Authorization: Bearer $ADMIN_API_KEY" "http://${ar_io_host}:${ar_io_port}/ar-io/admin/export-parquet/status"); then
echo "Failed to get export status"
rm -f "$parquet_dir"/*.parquet
exit 1
fi
export_status=$(echo "$status" | jq -r '.status')
if [ "$export_status" = "completed" ]; then
break
elif [ "$export_status" = "errored" ]; then
error=$(echo "$status" | jq -r '.error')
echo "Export failed: $error"
rm -f "$parquet_dir"/*.parquet
exit 1
fi
echo "Waiting for export to complete..."
sleep 10
done
# Set maximum number of retries (e.g., 30 retries with 10-second intervals = 5 minutes)
max_retries=30
retry_count=0
# Wait for the export to complete
while true; do
if ! status=$(curl -s -f -H "Authorization: Bearer $ADMIN_API_KEY" "http://${ar_io_host}:${ar_io_port}/ar-io/admin/export-parquet/status"); then
echo "Failed to get export status"
rm -f "$parquet_dir"/*.parquet
exit 1
fi
export_status=$(echo "$status" | jq -r '.status')
if [ "$export_status" = "completed" ]; then
break
elif [ "$export_status" = "errored" ]; then
error=$(echo "$status" | jq -r '.error')
echo "Export failed: $error"
rm -f "$parquet_dir"/*.parquet
exit 1
fi
echo "Waiting for export to complete..."
sleep 10
retry_count=$((retry_count + 1))
if [ "$retry_count" -ge "$max_retries" ]; then
echo "Export did not complete within expected time"
rm -f "$parquet_dir"/*.parquet
exit 1
fi
done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant