feat: ClickHouse import automation PE-7137 #241

djwhitt · 2024-11-27T16:20:38Z

No description provided.

…E-7137 Adding min/max data item heights to debug endpoint output both for general utility and to track what height ranges are available to export as Parquet.

This index will be used to quickly determine what stable data items to delete after Parquet exports and imports complete.

This will be used by the Parquet => ClickHouse export automation to cleanup data items that have already been exported.

…ckHouse import PE-7137 Adds a clickhouse-auto-import script that loops forever exporting Parquet from the SQLite DB, importing it into ClickHouse, moving it to another directory for potential archival, and pruning old data from SQLite. When using Docker Compose, the script is run automatically in a separate service when the 'clickhouse' profile is in use. Currently the script is written in bash. In the future we may integrate it into the core service TypeScript codebase, but having a standalone script initially allows for faster iteration.

codecov · 2024-11-27T16:22:26Z

Codecov Report

Attention: Patch coverage is 36.36364% with 14 lines in your changes missing coverage. Please review.

Project coverage is 70.08%. Comparing base (c40ed8e) to head (771083c).
Report is 9 commits behind head on develop.

Files with missing lines	Patch %	Lines
src/database/standalone-sqlite.ts	36.36%	14 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           develop     #241      +/-   ##
===========================================
- Coverage    70.17%   70.08%   -0.10%     
===========================================
  Files           33       33              
  Lines         8211     8233      +22     
  Branches       455      460       +5     
===========================================
+ Hits          5762     5770       +8     
- Misses        2447     2461      +14     
  Partials         2        2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

coderabbitai · 2024-11-27T16:24:38Z

📝 Walkthrough

Walkthrough

This pull request introduces a new Dockerfile for setting up a ClickHouse auto-import environment, along with modifications to the docker-compose.yaml to add a new service for this functionality. It also includes the creation of a new SQL index, updates to existing SQL commands, and the addition of scripts for data import automation. Furthermore, it enhances the database management capabilities by adding methods for pruning stable data items and introducing a new API endpoint for this operation.

Changes

File	Change Summary
`Dockerfile.clickhouse-auto-import`	New Dockerfile created for ClickHouse auto-import setup, includes installation of ClickHouse and script setup.
`docker-compose.yaml`	Added new service `clickhouse-auto-import`, updated `core` service with new volume mapping, and modified `clickhouse` service for network configuration.
`migrations/2024.11.21T20.28.45.bundles.data-items-indexed-at-index.sql`	New index `stable_data_items_indexed_at_idx` created on `indexed_at` column of `stable_data_items` table.
`migrations/down/2024.11.21T20.28.45.bundles.data-items-indexed-at-index.sql`	SQL command added to drop index `stable_data_items_indexed_at_idx`.
`scripts/clickhouse-auto-import`	New script for automating data export from an API to Parquet files and importing into ClickHouse.
`scripts/clickhouse-import`	Modified to include `clickhouse_host` variable for connecting to specified ClickHouse host.
`src/database/sql/bundles/cleanup.sql`	New SQL command to delete records from `stable_data_items` where `indexed_at` is below a threshold.
`src/database/sql/bundles/stats.sql`	Enhanced SQL queries to include minimum and maximum height metrics in data item statistics.
`src/database/standalone-sqlite.ts`	New methods `pruneStableDataItems` added for deleting stable data items based on indexed timestamp.
`src/routes/ar-io.ts`	New endpoint `POST /ar-io/admin/prune-stable-data-items` added for pruning stable data items, with validation for input.

Possibly related PRs

feat(bundles): add new data item properties #164: This PR adds new properties to data items, which may relate to the handling of data in the ClickHouse import process.
feat(parquet-exporter): export blocks, txs and data items to parquet #213: This PR enhances the ParquetExporter to support exporting blocks, transactions, and data items, which is directly relevant to the new ClickHouse import script.
feat(parquet-exporter): run the exporter from worker #215: This PR refactors the ParquetExporter to run from a worker, which could be related to how data is imported into ClickHouse.
feat(clickhouse): add ClickHouse import script PE-6848 #237: This PR introduces a ClickHouse import script that directly relates to the main PR's changes, as it facilitates the import of Parquet files into ClickHouse.

Suggested reviewers

karlprieb

Warning

There were issues while running some tools. Please review the errors and either fix the tool’s configuration or disable the tool if it’s a critical failure.

🔧 eslint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

error @permaweb/[email protected]: The engine "yarn" is incompatible with this module. Expected version "please-use-npm". Got "1.22.22"
error Found incompatible module.

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 14

🧹 Outside diff range and nitpick comments (9)

src/database/sql/bundles/cleanup.sql (1)
16-17: Consider handling NULL values explicitly

The current WHERE clause might delete rows where indexed_at is NULL. If this is not intended, consider adding an explicit NULL check.
DELETE FROM stable_data_items
-WHERE indexed_at < @indexed_at_threshold
+WHERE indexed_at IS NOT NULL AND indexed_at < @indexed_at_threshold
Dockerfile.clickhouse-auto-import (2)
11-13: Consider documenting directory structure and setting explicit permissions

While the directory structure is created, its purpose and permissions should be more clearly defined.

Consider this enhancement:
 # Create necessary directories
 WORKDIR /app
-RUN mkdir -p data/parquet/imported
+# Create directories with appropriate permissions for imported data
+RUN mkdir -p data/parquet/imported && \
+    chown -R nobody:nogroup data && \
+    chmod -R 755 data
Also, consider adding a comment explaining the purpose of each directory in the structure.

1-26: Consider implementing health checks and logging strategy

While the basic functionality is in place, the Dockerfile could benefit from additional operational features:

Add HEALTHCHECK instruction to monitor the import process

Implement structured logging

Consider adding a volume for persistent data

Add container labels for metadata

Example health check implementation:
+# Add monitoring capabilities
+HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
+    CMD curl -f http://localhost:8123/ping || exit 1
+
+# Add metadata
+LABEL maintainer="AR.IO Team" \
+      description="ClickHouse auto-import service" \
+      version="1.0"
scripts/clickhouse-import (2)
Line range hint 6-18: Fix potential security and reliability issues in file handling

Several issues need attention:

Using ls for filename matching is unsafe

Missing file existence validation

SQL injection vulnerability through filenames

No error handling for failed imports

Apply these improvements:
-for blocks_parquet in "$parquet_path/blocks"*.parquet; do
+# Use find instead of shell globbing for safer file handling
+while IFS= read -r blocks_parquet; do
+    # Validate file exists and is readable
+    if [[ ! -r "$blocks_parquet" ]]; then
+        echo "Error: Cannot read $blocks_parquet" >&2
+        continue
+    }
+
     height_range=$(basename "$blocks_parquet" | sed 's/blocks-//;s/-rowCount:[0-9]*\.parquet//')
-    txs_parquet=$(ls data/parquet/transactions-"$height_range"-*.parquet)
-    tags_parquet=$(ls data/parquet/tags-"$height_range"-*.parquet)
+    # Use find for safer file matching
+    txs_parquet=$(find "$parquet_path" -type f -name "transactions-${height_range}-*.parquet" -print -quit)
+    tags_parquet=$(find "$parquet_path" -type f -name "tags-${height_range}-*.parquet" -print -quit)
+
+    # Validate all required files exist
+    if [[ ! -r "$txs_parquet" || ! -r "$tags_parquet" ]]; then
+        echo "Error: Missing transaction or tag files for height range $height_range" >&2
+        continue
+    }
+
+    # Escape filenames for SQL safety
+    blocks_parquet_escaped="${blocks_parquet//\'/\'\'}"
+    txs_parquet_escaped="${txs_parquet//\'/\'\'}"
+    tags_parquet_escaped="${tags_parquet//\'/\'\'}"

     echo "Importing $blocks_parquet..."
-    clickhouse client --host "$clickhouse_host" --query="INSERT INTO staging_blocks FROM INFILE '$blocks_parquet' FORMAT Parquet;"
+    if ! clickhouse client --host "$clickhouse_host" --query="INSERT INTO staging_blocks FROM INFILE '$blocks_parquet_escaped' FORMAT Parquet"; then
+        echo "Error: Failed to import blocks file $blocks_parquet" >&2
+        continue
+    fi
-done
+done < <(find "$parquet_path" -type f -name "blocks*.parquet")
Line range hint 20-56: SQL query looks good with room for optimization

The SQL query is well-structured and handles the data transformation correctly. Consider adding an ORDER BY clause to potentially improve the groupArray performance if the order of tags is important.

Optional optimization:
   END AS tags
 FROM staging_transactions txs
 LEFT JOIN staging_tags tags ON txs.height = tags.height AND txs.id = tags.id
 JOIN staging_blocks blocks ON txs.height = blocks.height
-GROUP BY 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, tags.id
+GROUP BY 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, tags.id
+ORDER BY tags.tag_index
docker-compose.yaml (1)
152-153: Consider adding volume for logs

To aid in debugging and monitoring, consider adding a volume mapping for logs.
     volumes:
       - ${PARQUET_DATA_PATH:-./data/parquet}:/app/data/parquet
+      - ${CLICKHOUSE_AUTO_IMPORT_LOGS_PATH:-./logs/clickhouse-auto-import}:/app/logs
src/routes/ar-io.ts (1)
317-337: Consider adding metrics and progress tracking

For better operability, consider these improvements:

Add metrics for monitoring (items deleted, time taken)

Add progress tracking for long-running operations

Add dry-run mode for safety

Here's how you might implement these features:
 arIoRouter.post(
   '/ar-io/admin/prune-stable-data-items',
   express.json(),
   async (req, res) => {
     try {
-      const { indexedAtThreshold } = req.body;
+      const { indexedAtThreshold, dryRun = false } = req.body;
 
       if (!Number.isInteger(indexedAtThreshold) || indexedAtThreshold < 0) {
         res
           .status(400)
           .send('Invalid indexedAtThreshold - must be a positive integer');
         return;
       }
 
+      const startTime = Date.now();
+      const result = await db.pruneStableDataItems(indexedAtThreshold, {
+        dryRun,
+        onProgress: (progress) => {
+          metrics.pruneProgress.set(progress);
+        },
+      });
-      await db.pruneStableDataItems(indexedAtThreshold);
+
+      const duration = Date.now() - startTime;
+      metrics.pruneOperationDuration.observe(duration);
+      metrics.itemsPruned.inc(result.itemsPruned);
+
       res.json({ message: 'Stable data items pruned successfully' });
     } catch (error: any) {
       res.status(500).send(error?.message);
     }
   },
 );
src/database/standalone-sqlite.ts (1)
2430-2435: Consider adding logging and batching.

The implementation is correct but could benefit from:

Adding logging to track the number of deleted records

Implementing batching for large deletions to prevent potential performance issues

Consider enhancing the implementation:
 pruneStableDataItems(indexedAtThreshold: number) {
+  const before = this.stmts.bundles.countStableDataItems.get().count;
   this.stmts.bundles.deleteStableDataItemsLessThanIndexedAt.run({
     indexed_at_threshold: indexedAtThreshold
   });
+  const after = this.stmts.bundles.countStableDataItems.get().count;
+  this.log.info('Pruned stable data items', {
+    deletedCount: before - after,
+    threshold: indexedAtThreshold
+  });
 }
scripts/clickhouse-auto-import (1)
36-36: Typo in comment: Correct 'inverals' to 'intervals'

In the comment on line 36, "inverals" should be "intervals" to fix the spelling error.

Apply this diff to fix the typo:
-# Align to inverals of 10,000
+# Align to intervals of 10,000

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 692c366 and 771083c.

📒 Files selected for processing (10)

Dockerfile.clickhouse-auto-import (1 hunks)
docker-compose.yaml (2 hunks)
migrations/2024.11.21T20.28.45.bundles.data-items-indexed-at-index.sql (1 hunks)
migrations/down/2024.11.21T20.28.45.bundles.data-items-indexed-at-index.sql (1 hunks)
scripts/clickhouse-auto-import (1 hunks)
scripts/clickhouse-import (3 hunks)
src/database/sql/bundles/cleanup.sql (1 hunks)
src/database/sql/bundles/stats.sql (3 hunks)
src/database/standalone-sqlite.ts (5 hunks)
src/routes/ar-io.ts (2 hunks)

✅ Files skipped from review due to trivial changes (2)

migrations/2024.11.21T20.28.45.bundles.data-items-indexed-at-index.sql
migrations/down/2024.11.21T20.28.45.bundles.data-items-indexed-at-index.sql

🔇 Additional comments (13)

src/database/sql/bundles/cleanup.sql (1)

16-17: Verify index existence for performance

The deletion query filters on the indexed_at column. To ensure optimal performance, verify that an index exists on this column in the stable_data_items table.

✅ Verification successful

Index exists for the indexed_at column

An index stable_data_items_indexed_at_idx has been created on the indexed_at column in the migration file migrations/2024.11.21T20.28.45.bundles.data-items-indexed-at-index.sql. The deletion query will be able to use this index for optimal performance.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check if an index exists on the indexed_at column of stable_data_items table

# Search for index creation statements
rg -A 2 "CREATE.*INDEX.*stable_data_items.*indexed_at"

# Search for table creation to verify indices
rg -A 10 "CREATE.*TABLE.*stable_data_items"

Length of output: 3963

Dockerfile.clickhouse-auto-import (1)

15-20: 🛠️ Refactor suggestion

Enhance script handling security

The current script handling could be improved for better security and reliability.

Consider these improvements:

-# Copy the auto-import script
+# Copy and verify import scripts
 COPY scripts/clickhouse-auto-import /app/scripts/
 COPY scripts/clickhouse-import /app/scripts/
 
-# Make scripts executable
-RUN chmod +x /app/scripts/clickhouse-auto-import /app/scripts/clickhouse-import
+# Set appropriate ownership and permissions
+RUN chown nobody:nogroup /app/scripts/clickhouse-* && \
+    chmod 500 /app/scripts/clickhouse-* && \
+    # Verify scripts exist and are executable
+    test -x /app/scripts/clickhouse-auto-import && \
+    test -x /app/scripts/clickhouse-import

Let's verify the referenced scripts exist in the repository:

✅ Verification successful

Scripts exist but ownership should be set to a non-root user

The scripts are present in the repository and contain appropriate functionality for ClickHouse data import automation. While the current chmod +x approach works, setting specific ownership and restrictive permissions would enhance security.

Recommended changes:

Set ownership to nobody:nogroup as these scripts don't require root privileges
Use more restrictive permissions (500) to prevent modification after container build
Add existence verification as an additional safety check

-# Copy the auto-import script
+# Copy and verify import scripts
 COPY scripts/clickhouse-auto-import /app/scripts/
 COPY scripts/clickhouse-import /app/scripts/
 
-# Make scripts executable
-RUN chmod +x /app/scripts/clickhouse-auto-import /app/scripts/clickhouse-import
+# Set appropriate ownership and permissions
+RUN chown nobody:nogroup /app/scripts/clickhouse-* && \
+    chmod 500 /app/scripts/clickhouse-* && \
+    # Verify scripts exist and are executable
+    test -x /app/scripts/clickhouse-auto-import && \
+    test -x /app/scripts/clickhouse-import

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Verify the existence and content of the import scripts
fd -t f "clickhouse-(auto-)?import$" -x cat {}

Length of output: 6239

src/database/sql/bundles/stats.sql (2)

17-21: LGTM! Well-structured addition of height metrics

The new height metrics follow consistent patterns:

Proper separation between new and stable items
Consistent use of IFNULL(-1) default values
Clear naming convention

17-21: Verify integration with pruning functionality

The new height metrics appear to support the pruning functionality mentioned in the AI summary. However, let's verify the height thresholds are consistently used across the codebase.

Also applies to: 32-36, 48-52

✅ Verification successful

Let me gather more information to verify the usage of these height metrics in the context of pruning functionality.

Let me check one more aspect to understand how these height metrics are used in the SQL queries and bundle processing.

Based on the comprehensive analysis of the codebase, I can now provide the final response:

Height metrics are correctly integrated with pruning functionality

The height metrics in the stats query are used to:

Track data chronology through min_new_height, max_new_height, min_stable_height, and max_stable_height
Support pruning via pruneStableDataItems which uses indexed_at_threshold for cleanup
Maintain data consistency through height-based indices and cleanup operations

The implementation aligns with the pruning functionality as evidenced by:

Height-based indices in schema definitions
Pruning logic in src/database/sql/bundles/cleanup.sql
Height tracking in both new and stable data items tables

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for height-based pruning logic in related files

# Look for height-based pruning conditions
rg -A 5 "height.*prune|prune.*height" --type ts

# Check for other uses of these metrics
rg -A 5 "(min|max)_(new|stable)_height" --type ts

Length of output: 835

Script:

#!/bin/bash
# Search for pruning-related code and configurations
rg -A 10 "prune" --type ts --type js

# Search for height-related configurations
rg -A 5 "height.*config|config.*height" --type ts --type js

# Look for bundle-related pruning logic
ast-grep --pattern 'function $FUNC($$$) {
  $$$
  height
  $$$
  bundle
  $$$
}'

Length of output: 3352

Script:

#!/bin/bash
# Check SQL files for height-related queries
rg -A 10 "height" --type sql

# Look for bundle processing logic
ast-grep --pattern 'async function $FUNC($$$) {
  $$$
  bundle
  $$$
}'

# Check for any configuration files that might define height thresholds
fd -e json -e yaml -e yml -x cat {}

Length of output: 103877

scripts/clickhouse-import (1)

58-60: 🛠️ Refactor suggestion

Combine truncates into a single transactional operation

The separate TRUNCATE commands could lead to an inconsistent state if one fails. Additionally, multiple network round trips impact performance.

Combine the operations:

-    clickhouse client --host "$clickhouse_host" --query="TRUNCATE TABLE staging_blocks"
-    clickhouse client --host "$clickhouse_host" --query="TRUNCATE TABLE staging_transactions"
-    clickhouse client --host "$clickhouse_host" --query="TRUNCATE TABLE staging_tags"
+    cat <<EOF | clickhouse client --host "$clickhouse_host"
+START TRANSACTION;
+TRUNCATE TABLE staging_blocks;
+TRUNCATE TABLE staging_transactions;
+TRUNCATE TABLE staging_tags;
+COMMIT;
+EOF

Likely invalid or redundant comment.

docker-compose.yaml (3)

42-42: LGTM: Parquet volume mapping follows established patterns

The new volume mapping for Parquet data follows the project's conventions and provides a sensible default path.

141-142: LGTM: Network configuration aligns with service requirements

Adding the clickhouse service to ar-io-network enables proper communication with other services, particularly the new auto-import service.

159-159: Review admin API key usage

The service is using the admin API key for authentication. Consider:

Creating a separate, limited-scope API key for this service
Documenting the required API permissions

src/routes/ar-io.ts (1)

26-26: LGTM!

The import is correctly placed and properly used in the new endpoint.

src/database/standalone-sqlite.ts (4)

374-377: LGTM: New height fields for data items tracking.

The new fields follow the existing pattern and provide valuable debugging information about data item heights in both stable and new states.

1138-1141: LGTM: Proper implementation of data item height fields.

The implementation correctly retrieves values from dataItemStats with appropriate fallback to -1, consistent with other height fields.

3099-3102: LGTM: Proper worker queue implementation.

The method correctly queues the pruning operation to the bundles worker with appropriate write access and return type.

3288-3291: LGTM: Consistent message handler implementation.

The new case follows the established pattern for message handling and maintains the code organization.

coderabbitai · 2024-11-27T16:24:41Z

Dockerfile.clickhouse-auto-import

+# Environment variables
+ENV ADMIN_API_KEY=""


⚠️ Potential issue

Document and validate environment variables

Setting ADMIN_API_KEY to an empty string could lead to security issues if not properly configured at runtime.

Consider adding validation and documentation:

-# Environment variables -ENV ADMIN_API_KEY="" +# Required environment variables +# ADMIN_API_KEY: API key for authentication (required, non-empty) +ENV ADMIN_API_KEY= + +# Validate environment variables at runtime +COPY scripts/validate-env.sh /app/scripts/ +RUN chmod 500 /app/scripts/validate-env.sh +ENTRYPOINT ["/app/scripts/validate-env.sh"]

Would you like me to help create the validate-env.sh script that checks for required environment variables at container startup?

Committable suggestion skipped: line range outside the PR's diff.

coderabbitai · 2024-11-27T16:24:41Z

Dockerfile.clickhouse-auto-import

+# Run the auto-import script
+CMD ["/app/scripts/clickhouse-auto-import"]


🛠️ Refactor suggestion

Switch to non-root user and use ENTRYPOINT

The container currently runs as root, which is a security risk. Additionally, the startup command could be more robust.

Consider these security improvements:

+# Create non-root user for running the container +RUN useradd -r -s /bin/false clickhouse + +# Switch to non-root user +USER clickhouse + # Run the auto-import script -CMD ["/app/scripts/clickhouse-auto-import"] +ENTRYPOINT ["/app/scripts/validate-env.sh"] +CMD ["/app/scripts/clickhouse-auto-import"]

This change:

Creates a dedicated non-root user

Uses ENTRYPOINT for environment validation

Maintains CMD for the primary process

Committable suggestion skipped: line range outside the PR's diff.

coderabbitai · 2024-11-27T16:24:41Z

Dockerfile.clickhouse-auto-import

+RUN apt-get update && apt-get install -y \
+    curl \
+    jq \
+    && rm -rf /var/lib/apt/lists/* \
+    && curl https://clickhouse.com/ | sh \
+    && mv clickhouse /usr/local/bin/


⚠️ Potential issue

Critical: Unsafe installation method and missing version controls

The current installation method has several security concerns:

Piping curl directly to shell is unsafe as it executes unverified code

No version pinning for ClickHouse or system packages

No checksum verification of downloaded content

Consider this safer alternative:

RUN apt-get update && apt-get install -y \ - curl \ - jq \ + curl=7.74.0-1.3+deb11u11 \ + jq=1.6-2.1 \ && rm -rf /var/lib/apt/lists/* \ - && curl https://clickhouse.com/ | sh \ - && mv clickhouse /usr/local/bin/ + && CLICKHOUSE_VERSION="23.8.2.7" \ + && CLICKHOUSE_PACKAGE="clickhouse-common-static-${CLICKHOUSE_VERSION}.tgz" \ + && curl -fsSL "https://packages.clickhouse.com/tgz/${CLICKHOUSE_PACKAGE}" -o clickhouse.tgz \ + && echo "${EXPECTED_HASH} clickhouse.tgz" | sha256sum -c \ + && tar xvf clickhouse.tgz \ + && cp ./clickhouse usr/local/bin/ \ + && rm -rf clickhouse.tgz

Also consider:

Using multi-stage builds to reduce final image size

Adding HEALTHCHECK instruction to verify ClickHouse installation

Committable suggestion skipped: line range outside the PR's diff.

coderabbitai · 2024-11-27T16:24:41Z

src/database/sql/bundles/stats.sql

+    null AS max_stable_indexed_at,
+    MIN(height) AS min_new_height,
+    null AS min_stable_height,
+    MAX(height) AS max_new_height,
+    null AS max_stable_height


🛠️ Refactor suggestion

Standardize null vs -1 for unavailable data

There's an inconsistency in how unavailable data is represented between the subqueries:

new_data_items uses null for stable metrics

stable_data_items uses -1 for new metrics

This inconsistency could lead to confusion or bugs. Consider standardizing to either null or -1.

SELECT COUNT(*) AS data_item_count, SUM( CASE WHEN parent_id != root_transaction_id THEN 1 ELSE 0 END ) AS nested_data_item_count, IFNULL(MAX(indexed_at), -1) AS max_new_indexed_at, - null AS max_stable_indexed_at, + -1 AS max_stable_indexed_at, MIN(height) AS min_new_height, - null AS min_stable_height, + -1 AS min_stable_height, MAX(height) AS max_new_height, - null AS max_stable_height + -1 AS max_stable_height

And similarly for the stable_data_items subquery:

-1 AS max_new_indexed_at, IFNULL(MAX(indexed_at), -1) AS max_stable_indexed_at, - null AS min_new_height, + -1 AS min_new_height, MIN(height) AS min_stable_height, - null AS max_new_height, + -1 AS max_new_height, MAX(height) AS max_stable_height

Also applies to: 48-52

coderabbitai · 2024-11-27T16:24:42Z

scripts/clickhouse-import

+clickhouse_host=${CLICKHOUSE_HOST:-localhost}
 parquet_path=${PARQUET_PATH:-data/parquet}


🛠️ Refactor suggestion

Add connection validation and error handling

While the host configuration is good, we should validate the connection before proceeding with imports to fail fast and provide clear error messages.

Add this validation after the variable declarations:

clickhouse_host=${CLICKHOUSE_HOST:-localhost} parquet_path=${PARQUET_PATH:-data/parquet} + +# Validate ClickHouse connection +if ! clickhouse client --host "$clickhouse_host" --query "SELECT 1" >/dev/null 2>&1; then + echo "Error: Unable to connect to ClickHouse at $clickhouse_host" >&2 + exit 1 +fi

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

clickhouse_host=${CLICKHOUSE_HOST:-localhost}

parquet_path=${PARQUET_PATH:-data/parquet}

clickhouse_host=${CLICKHOUSE_HOST:-localhost}

parquet_path=${PARQUET_PATH:-data/parquet}

# Validate ClickHouse connection

if ! clickhouse client --host "$clickhouse_host" --query "SELECT 1" >/dev/null 2>&1; then

echo "Error: Unable to connect to ClickHouse at $clickhouse_host" >&2

exit 1

fi

coderabbitai · 2024-11-27T16:24:42Z