-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: ClickHouse import automation PE-7137 #241
base: develop
Are you sure you want to change the base?
Conversation
…E-7137 Adding min/max data item heights to debug endpoint output both for general utility and to track what height ranges are available to export as Parquet.
This index will be used to quickly determine what stable data items to delete after Parquet exports and imports complete.
This will be used by the Parquet => ClickHouse export automation to cleanup data items that have already been exported.
…ckHouse import PE-7137 Adds a clickhouse-auto-import script that loops forever exporting Parquet from the SQLite DB, importing it into ClickHouse, moving it to another directory for potential archival, and pruning old data from SQLite. When using Docker Compose, the script is run automatically in a separate service when the 'clickhouse' profile is in use. Currently the script is written in bash. In the future we may integrate it into the core service TypeScript codebase, but having a standalone script initially allows for faster iteration.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #241 +/- ##
===========================================
- Coverage 70.17% 70.08% -0.10%
===========================================
Files 33 33
Lines 8211 8233 +22
Branches 455 460 +5
===========================================
+ Hits 5762 5770 +8
- Misses 2447 2461 +14
Partials 2 2 ☔ View full report in Codecov by Sentry. |
📝 WalkthroughWalkthroughThis pull request introduces a new Dockerfile for setting up a ClickHouse auto-import environment, along with modifications to the Changes
Possibly related PRs
Suggested reviewers
Warning There were issues while running some tools. Please review the errors and either fix the tool’s configuration or disable the tool if it’s a critical failure. 🔧 eslint
error @permaweb/[email protected]: The engine "yarn" is incompatible with this module. Expected version "please-use-npm". Got "1.22.22" Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 14
🧹 Outside diff range and nitpick comments (9)
src/database/sql/bundles/cleanup.sql (1)
16-17
: Consider handling NULL values explicitlyThe current WHERE clause might delete rows where
indexed_at
is NULL. If this is not intended, consider adding an explicit NULL check.DELETE FROM stable_data_items -WHERE indexed_at < @indexed_at_threshold +WHERE indexed_at IS NOT NULL AND indexed_at < @indexed_at_thresholdDockerfile.clickhouse-auto-import (2)
11-13
: Consider documenting directory structure and setting explicit permissionsWhile the directory structure is created, its purpose and permissions should be more clearly defined.
Consider this enhancement:
# Create necessary directories WORKDIR /app -RUN mkdir -p data/parquet/imported +# Create directories with appropriate permissions for imported data +RUN mkdir -p data/parquet/imported && \ + chown -R nobody:nogroup data && \ + chmod -R 755 dataAlso, consider adding a comment explaining the purpose of each directory in the structure.
1-26
: Consider implementing health checks and logging strategyWhile the basic functionality is in place, the Dockerfile could benefit from additional operational features:
- Add HEALTHCHECK instruction to monitor the import process
- Implement structured logging
- Consider adding a volume for persistent data
- Add container labels for metadata
Example health check implementation:
+# Add monitoring capabilities +HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \ + CMD curl -f http://localhost:8123/ping || exit 1 + +# Add metadata +LABEL maintainer="AR.IO Team" \ + description="ClickHouse auto-import service" \ + version="1.0"scripts/clickhouse-import (2)
Line range hint
6-18
: Fix potential security and reliability issues in file handlingSeveral issues need attention:
- Using
ls
for filename matching is unsafe- Missing file existence validation
- SQL injection vulnerability through filenames
- No error handling for failed imports
Apply these improvements:
-for blocks_parquet in "$parquet_path/blocks"*.parquet; do +# Use find instead of shell globbing for safer file handling +while IFS= read -r blocks_parquet; do + # Validate file exists and is readable + if [[ ! -r "$blocks_parquet" ]]; then + echo "Error: Cannot read $blocks_parquet" >&2 + continue + } + height_range=$(basename "$blocks_parquet" | sed 's/blocks-//;s/-rowCount:[0-9]*\.parquet//') - txs_parquet=$(ls data/parquet/transactions-"$height_range"-*.parquet) - tags_parquet=$(ls data/parquet/tags-"$height_range"-*.parquet) + # Use find for safer file matching + txs_parquet=$(find "$parquet_path" -type f -name "transactions-${height_range}-*.parquet" -print -quit) + tags_parquet=$(find "$parquet_path" -type f -name "tags-${height_range}-*.parquet" -print -quit) + + # Validate all required files exist + if [[ ! -r "$txs_parquet" || ! -r "$tags_parquet" ]]; then + echo "Error: Missing transaction or tag files for height range $height_range" >&2 + continue + } + + # Escape filenames for SQL safety + blocks_parquet_escaped="${blocks_parquet//\'/\'\'}" + txs_parquet_escaped="${txs_parquet//\'/\'\'}" + tags_parquet_escaped="${tags_parquet//\'/\'\'}" echo "Importing $blocks_parquet..." - clickhouse client --host "$clickhouse_host" --query="INSERT INTO staging_blocks FROM INFILE '$blocks_parquet' FORMAT Parquet;" + if ! clickhouse client --host "$clickhouse_host" --query="INSERT INTO staging_blocks FROM INFILE '$blocks_parquet_escaped' FORMAT Parquet"; then + echo "Error: Failed to import blocks file $blocks_parquet" >&2 + continue + fi -done +done < <(find "$parquet_path" -type f -name "blocks*.parquet")
Line range hint
20-56
: SQL query looks good with room for optimizationThe SQL query is well-structured and handles the data transformation correctly. Consider adding an ORDER BY clause to potentially improve the groupArray performance if the order of tags is important.
Optional optimization:
END AS tags FROM staging_transactions txs LEFT JOIN staging_tags tags ON txs.height = tags.height AND txs.id = tags.id JOIN staging_blocks blocks ON txs.height = blocks.height -GROUP BY 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, tags.id +GROUP BY 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, tags.id +ORDER BY tags.tag_indexdocker-compose.yaml (1)
152-153
: Consider adding volume for logsTo aid in debugging and monitoring, consider adding a volume mapping for logs.
volumes: - ${PARQUET_DATA_PATH:-./data/parquet}:/app/data/parquet + - ${CLICKHOUSE_AUTO_IMPORT_LOGS_PATH:-./logs/clickhouse-auto-import}:/app/logs
src/routes/ar-io.ts (1)
317-337
: Consider adding metrics and progress trackingFor better operability, consider these improvements:
- Add metrics for monitoring (items deleted, time taken)
- Add progress tracking for long-running operations
- Add dry-run mode for safety
Here's how you might implement these features:
arIoRouter.post( '/ar-io/admin/prune-stable-data-items', express.json(), async (req, res) => { try { - const { indexedAtThreshold } = req.body; + const { indexedAtThreshold, dryRun = false } = req.body; if (!Number.isInteger(indexedAtThreshold) || indexedAtThreshold < 0) { res .status(400) .send('Invalid indexedAtThreshold - must be a positive integer'); return; } + const startTime = Date.now(); + const result = await db.pruneStableDataItems(indexedAtThreshold, { + dryRun, + onProgress: (progress) => { + metrics.pruneProgress.set(progress); + }, + }); - await db.pruneStableDataItems(indexedAtThreshold); + + const duration = Date.now() - startTime; + metrics.pruneOperationDuration.observe(duration); + metrics.itemsPruned.inc(result.itemsPruned); + res.json({ message: 'Stable data items pruned successfully' }); } catch (error: any) { res.status(500).send(error?.message); } }, );src/database/standalone-sqlite.ts (1)
2430-2435
: Consider adding logging and batching.The implementation is correct but could benefit from:
- Adding logging to track the number of deleted records
- Implementing batching for large deletions to prevent potential performance issues
Consider enhancing the implementation:
pruneStableDataItems(indexedAtThreshold: number) { + const before = this.stmts.bundles.countStableDataItems.get().count; this.stmts.bundles.deleteStableDataItemsLessThanIndexedAt.run({ indexed_at_threshold: indexedAtThreshold }); + const after = this.stmts.bundles.countStableDataItems.get().count; + this.log.info('Pruned stable data items', { + deletedCount: before - after, + threshold: indexedAtThreshold + }); }scripts/clickhouse-auto-import (1)
36-36
: Typo in comment: Correct 'inverals' to 'intervals'In the comment on line 36, "inverals" should be "intervals" to fix the spelling error.
Apply this diff to fix the typo:
-# Align to inverals of 10,000 +# Align to intervals of 10,000
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (10)
Dockerfile.clickhouse-auto-import
(1 hunks)docker-compose.yaml
(2 hunks)migrations/2024.11.21T20.28.45.bundles.data-items-indexed-at-index.sql
(1 hunks)migrations/down/2024.11.21T20.28.45.bundles.data-items-indexed-at-index.sql
(1 hunks)scripts/clickhouse-auto-import
(1 hunks)scripts/clickhouse-import
(3 hunks)src/database/sql/bundles/cleanup.sql
(1 hunks)src/database/sql/bundles/stats.sql
(3 hunks)src/database/standalone-sqlite.ts
(5 hunks)src/routes/ar-io.ts
(2 hunks)
✅ Files skipped from review due to trivial changes (2)
- migrations/2024.11.21T20.28.45.bundles.data-items-indexed-at-index.sql
- migrations/down/2024.11.21T20.28.45.bundles.data-items-indexed-at-index.sql
🔇 Additional comments (13)
src/database/sql/bundles/cleanup.sql (1)
16-17
: Verify index existence for performance
The deletion query filters on the indexed_at
column. To ensure optimal performance, verify that an index exists on this column in the stable_data_items
table.
✅ Verification successful
Index exists for the indexed_at column
An index stable_data_items_indexed_at_idx
has been created on the indexed_at
column in the migration file migrations/2024.11.21T20.28.45.bundles.data-items-indexed-at-index.sql
. The deletion query will be able to use this index for optimal performance.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check if an index exists on the indexed_at column of stable_data_items table
# Search for index creation statements
rg -A 2 "CREATE.*INDEX.*stable_data_items.*indexed_at"
# Search for table creation to verify indices
rg -A 10 "CREATE.*TABLE.*stable_data_items"
Length of output: 3963
Dockerfile.clickhouse-auto-import (1)
15-20
: 🛠️ Refactor suggestion
Enhance script handling security
The current script handling could be improved for better security and reliability.
Consider these improvements:
-# Copy the auto-import script
+# Copy and verify import scripts
COPY scripts/clickhouse-auto-import /app/scripts/
COPY scripts/clickhouse-import /app/scripts/
-# Make scripts executable
-RUN chmod +x /app/scripts/clickhouse-auto-import /app/scripts/clickhouse-import
+# Set appropriate ownership and permissions
+RUN chown nobody:nogroup /app/scripts/clickhouse-* && \
+ chmod 500 /app/scripts/clickhouse-* && \
+ # Verify scripts exist and are executable
+ test -x /app/scripts/clickhouse-auto-import && \
+ test -x /app/scripts/clickhouse-import
Let's verify the referenced scripts exist in the repository:
✅ Verification successful
Scripts exist but ownership should be set to a non-root user
The scripts are present in the repository and contain appropriate functionality for ClickHouse data import automation. While the current chmod +x
approach works, setting specific ownership and restrictive permissions would enhance security.
Recommended changes:
- Set ownership to
nobody:nogroup
as these scripts don't require root privileges - Use more restrictive permissions (500) to prevent modification after container build
- Add existence verification as an additional safety check
-# Copy the auto-import script
+# Copy and verify import scripts
COPY scripts/clickhouse-auto-import /app/scripts/
COPY scripts/clickhouse-import /app/scripts/
-# Make scripts executable
-RUN chmod +x /app/scripts/clickhouse-auto-import /app/scripts/clickhouse-import
+# Set appropriate ownership and permissions
+RUN chown nobody:nogroup /app/scripts/clickhouse-* && \
+ chmod 500 /app/scripts/clickhouse-* && \
+ # Verify scripts exist and are executable
+ test -x /app/scripts/clickhouse-auto-import && \
+ test -x /app/scripts/clickhouse-import
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Verify the existence and content of the import scripts
fd -t f "clickhouse-(auto-)?import$" -x cat {}
Length of output: 6239
src/database/sql/bundles/stats.sql (2)
17-21
: LGTM! Well-structured addition of height metrics
The new height metrics follow consistent patterns:
- Proper separation between new and stable items
- Consistent use of IFNULL(-1) default values
- Clear naming convention
17-21
: Verify integration with pruning functionality
The new height metrics appear to support the pruning functionality mentioned in the AI summary. However, let's verify the height thresholds are consistently used across the codebase.
Also applies to: 32-36, 48-52
✅ Verification successful
Let me gather more information to verify the usage of these height metrics in the context of pruning functionality.
Let me check one more aspect to understand how these height metrics are used in the SQL queries and bundle processing.
Based on the comprehensive analysis of the codebase, I can now provide the final response:
Height metrics are correctly integrated with pruning functionality
The height metrics in the stats query are used to:
- Track data chronology through
min_new_height
,max_new_height
,min_stable_height
, andmax_stable_height
- Support pruning via
pruneStableDataItems
which usesindexed_at_threshold
for cleanup - Maintain data consistency through height-based indices and cleanup operations
The implementation aligns with the pruning functionality as evidenced by:
- Height-based indices in schema definitions
- Pruning logic in
src/database/sql/bundles/cleanup.sql
- Height tracking in both new and stable data items tables
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for height-based pruning logic in related files
# Look for height-based pruning conditions
rg -A 5 "height.*prune|prune.*height" --type ts
# Check for other uses of these metrics
rg -A 5 "(min|max)_(new|stable)_height" --type ts
Length of output: 835
Script:
#!/bin/bash
# Search for pruning-related code and configurations
rg -A 10 "prune" --type ts --type js
# Search for height-related configurations
rg -A 5 "height.*config|config.*height" --type ts --type js
# Look for bundle-related pruning logic
ast-grep --pattern 'function $FUNC($$$) {
$$$
height
$$$
bundle
$$$
}'
Length of output: 3352
Script:
#!/bin/bash
# Check SQL files for height-related queries
rg -A 10 "height" --type sql
# Look for bundle processing logic
ast-grep --pattern 'async function $FUNC($$$) {
$$$
bundle
$$$
}'
# Check for any configuration files that might define height thresholds
fd -e json -e yaml -e yml -x cat {}
Length of output: 103877
scripts/clickhouse-import (1)
58-60
: 🛠️ Refactor suggestion
Combine truncates into a single transactional operation
The separate TRUNCATE commands could lead to an inconsistent state if one fails. Additionally, multiple network round trips impact performance.
Combine the operations:
- clickhouse client --host "$clickhouse_host" --query="TRUNCATE TABLE staging_blocks"
- clickhouse client --host "$clickhouse_host" --query="TRUNCATE TABLE staging_transactions"
- clickhouse client --host "$clickhouse_host" --query="TRUNCATE TABLE staging_tags"
+ cat <<EOF | clickhouse client --host "$clickhouse_host"
+START TRANSACTION;
+TRUNCATE TABLE staging_blocks;
+TRUNCATE TABLE staging_transactions;
+TRUNCATE TABLE staging_tags;
+COMMIT;
+EOF
Likely invalid or redundant comment.
docker-compose.yaml (3)
42-42
: LGTM: Parquet volume mapping follows established patterns
The new volume mapping for Parquet data follows the project's conventions and provides a sensible default path.
141-142
: LGTM: Network configuration aligns with service requirements
Adding the clickhouse service to ar-io-network enables proper communication with other services, particularly the new auto-import service.
159-159
: Review admin API key usage
The service is using the admin API key for authentication. Consider:
- Creating a separate, limited-scope API key for this service
- Documenting the required API permissions
src/routes/ar-io.ts (1)
26-26
: LGTM!
The import is correctly placed and properly used in the new endpoint.
src/database/standalone-sqlite.ts (4)
374-377
: LGTM: New height fields for data items tracking.
The new fields follow the existing pattern and provide valuable debugging information about data item heights in both stable and new states.
1138-1141
: LGTM: Proper implementation of data item height fields.
The implementation correctly retrieves values from dataItemStats with appropriate fallback to -1, consistent with other height fields.
3099-3102
: LGTM: Proper worker queue implementation.
The method correctly queues the pruning operation to the bundles worker with appropriate write access and return type.
3288-3291
: LGTM: Consistent message handler implementation.
The new case follows the established pattern for message handling and maintains the code organization.
# Environment variables | ||
ENV ADMIN_API_KEY="" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Document and validate environment variables
Setting ADMIN_API_KEY to an empty string could lead to security issues if not properly configured at runtime.
Consider adding validation and documentation:
-# Environment variables
-ENV ADMIN_API_KEY=""
+# Required environment variables
+# ADMIN_API_KEY: API key for authentication (required, non-empty)
+ENV ADMIN_API_KEY=
+
+# Validate environment variables at runtime
+COPY scripts/validate-env.sh /app/scripts/
+RUN chmod 500 /app/scripts/validate-env.sh
+ENTRYPOINT ["/app/scripts/validate-env.sh"]
Would you like me to help create the validate-env.sh
script that checks for required environment variables at container startup?
Committable suggestion skipped: line range outside the PR's diff.
# Run the auto-import script | ||
CMD ["/app/scripts/clickhouse-auto-import"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Switch to non-root user and use ENTRYPOINT
The container currently runs as root, which is a security risk. Additionally, the startup command could be more robust.
Consider these security improvements:
+# Create non-root user for running the container
+RUN useradd -r -s /bin/false clickhouse
+
+# Switch to non-root user
+USER clickhouse
+
# Run the auto-import script
-CMD ["/app/scripts/clickhouse-auto-import"]
+ENTRYPOINT ["/app/scripts/validate-env.sh"]
+CMD ["/app/scripts/clickhouse-auto-import"]
This change:
- Creates a dedicated non-root user
- Uses ENTRYPOINT for environment validation
- Maintains CMD for the primary process
Committable suggestion skipped: line range outside the PR's diff.
RUN apt-get update && apt-get install -y \ | ||
curl \ | ||
jq \ | ||
&& rm -rf /var/lib/apt/lists/* \ | ||
&& curl https://clickhouse.com/ | sh \ | ||
&& mv clickhouse /usr/local/bin/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Critical: Unsafe installation method and missing version controls
The current installation method has several security concerns:
- Piping curl directly to shell is unsafe as it executes unverified code
- No version pinning for ClickHouse or system packages
- No checksum verification of downloaded content
Consider this safer alternative:
RUN apt-get update && apt-get install -y \
- curl \
- jq \
+ curl=7.74.0-1.3+deb11u11 \
+ jq=1.6-2.1 \
&& rm -rf /var/lib/apt/lists/* \
- && curl https://clickhouse.com/ | sh \
- && mv clickhouse /usr/local/bin/
+ && CLICKHOUSE_VERSION="23.8.2.7" \
+ && CLICKHOUSE_PACKAGE="clickhouse-common-static-${CLICKHOUSE_VERSION}.tgz" \
+ && curl -fsSL "https://packages.clickhouse.com/tgz/${CLICKHOUSE_PACKAGE}" -o clickhouse.tgz \
+ && echo "${EXPECTED_HASH} clickhouse.tgz" | sha256sum -c \
+ && tar xvf clickhouse.tgz \
+ && cp ./clickhouse usr/local/bin/ \
+ && rm -rf clickhouse.tgz
Also consider:
- Using multi-stage builds to reduce final image size
- Adding HEALTHCHECK instruction to verify ClickHouse installation
Committable suggestion skipped: line range outside the PR's diff.
null AS max_stable_indexed_at, | ||
MIN(height) AS min_new_height, | ||
null AS min_stable_height, | ||
MAX(height) AS max_new_height, | ||
null AS max_stable_height |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Standardize null vs -1 for unavailable data
There's an inconsistency in how unavailable data is represented between the subqueries:
new_data_items
usesnull
for stable metricsstable_data_items
uses-1
for new metrics
This inconsistency could lead to confusion or bugs. Consider standardizing to either null
or -1
.
SELECT
COUNT(*) AS data_item_count,
SUM(
CASE
WHEN parent_id != root_transaction_id THEN 1
ELSE 0
END
) AS nested_data_item_count,
IFNULL(MAX(indexed_at), -1) AS max_new_indexed_at,
- null AS max_stable_indexed_at,
+ -1 AS max_stable_indexed_at,
MIN(height) AS min_new_height,
- null AS min_stable_height,
+ -1 AS min_stable_height,
MAX(height) AS max_new_height,
- null AS max_stable_height
+ -1 AS max_stable_height
And similarly for the stable_data_items subquery:
-1 AS max_new_indexed_at,
IFNULL(MAX(indexed_at), -1) AS max_stable_indexed_at,
- null AS min_new_height,
+ -1 AS min_new_height,
MIN(height) AS min_stable_height,
- null AS max_new_height,
+ -1 AS max_new_height,
MAX(height) AS max_stable_height
Also applies to: 48-52
clickhouse_host=${CLICKHOUSE_HOST:-localhost} | ||
parquet_path=${PARQUET_PATH:-data/parquet} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add connection validation and error handling
While the host configuration is good, we should validate the connection before proceeding with imports to fail fast and provide clear error messages.
Add this validation after the variable declarations:
clickhouse_host=${CLICKHOUSE_HOST:-localhost}
parquet_path=${PARQUET_PATH:-data/parquet}
+
+# Validate ClickHouse connection
+if ! clickhouse client --host "$clickhouse_host" --query "SELECT 1" >/dev/null 2>&1; then
+ echo "Error: Unable to connect to ClickHouse at $clickhouse_host" >&2
+ exit 1
+fi
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
clickhouse_host=${CLICKHOUSE_HOST:-localhost} | |
parquet_path=${PARQUET_PATH:-data/parquet} | |
clickhouse_host=${CLICKHOUSE_HOST:-localhost} | |
parquet_path=${PARQUET_PATH:-data/parquet} | |
# Validate ClickHouse connection | |
if ! clickhouse client --host "$clickhouse_host" --query "SELECT 1" >/dev/null 2>&1; then | |
echo "Error: Unable to connect to ClickHouse at $clickhouse_host" >&2 | |
exit 1 | |
fi |
done | ||
|
||
# Import Parquet files | ||
./scripts/clickhouse-import |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add error handling for the ClickHouse import script execution
The script ./scripts/clickhouse-import
is executed without checking for errors. If the import fails, subsequent steps may proceed incorrectly. Incorporate error handling to ensure failures are detected.
Apply this diff to check the exit status:
-./scripts/clickhouse-import
+if ! ./scripts/clickhouse-import; then
+ echo "ClickHouse import failed"
+ exit 1
+fi
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
./scripts/clickhouse-import | |
if ! ./scripts/clickhouse-import; then | |
echo "ClickHouse import failed" | |
exit 1 | |
fi |
./scripts/clickhouse-import | ||
|
||
# Move processed files to imported directory | ||
mv "$parquet_dir"/*.parquet "$imported_dir/" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handle potential failures when moving Parquet files
The mv
command may fail if there are no Parquet files to move or due to filesystem issues. Add error handling to manage such failures appropriately.
Apply this diff to add error checking:
-mv "$parquet_dir"/*.parquet "$imported_dir/"
+if ! mv "$parquet_dir"/*.parquet "$imported_dir/"; then
+ echo "Failed to move Parquet files to imported directory"
+ exit 1
+fi
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
mv "$parquet_dir"/*.parquet "$imported_dir/" | |
if ! mv "$parquet_dir"/*.parquet "$imported_dir/"; then | |
echo "Failed to move Parquet files to imported directory" | |
exit 1 | |
fi |
curl -X POST "http://${ar_io_host}:${ar_io_port}/ar-io/admin/prune-stable-data-items" \ | ||
-H "Authorization: Bearer $ADMIN_API_KEY" \ | ||
-H "Content-Type: application/json" \ | ||
-d "{\"indexedAtThreshold\": $max_indexed_at}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add error handling for pruning stable data items
The curl
command used to prune stable data items does not check for HTTP errors. If the request fails, the script will proceed without detecting the failure. Add the -f
flag and check the exit status to handle errors.
Apply this diff to add error checking:
-curl -X POST "http://${ar_io_host}:${ar_io_port}/ar-io/admin/prune-stable-data-items" \
+if ! curl -f -X POST "http://${ar_io_host}:${ar_io_port}/ar-io/admin/prune-stable-data-items" \
-H "Authorization: Bearer $ADMIN_API_KEY" \
-H "Content-Type: application/json" \
-d "{\"indexedAtThreshold\": $max_indexed_at}"; then
+ echo "Pruning stable data items failed"
+ exit 1
+fi
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
curl -X POST "http://${ar_io_host}:${ar_io_port}/ar-io/admin/prune-stable-data-items" \ | |
-H "Authorization: Bearer $ADMIN_API_KEY" \ | |
-H "Content-Type: application/json" \ | |
-d "{\"indexedAtThreshold\": $max_indexed_at}" | |
if ! curl -f -X POST "http://${ar_io_host}:${ar_io_port}/ar-io/admin/prune-stable-data-items" \ | |
-H "Authorization: Bearer $ADMIN_API_KEY" \ | |
-H "Content-Type: application/json" \ | |
-d "{\"indexedAtThreshold\": $max_indexed_at}"; then | |
echo "Pruning stable data items failed" | |
exit 1 | |
fi |
curl -X POST "http://${ar_io_host}:${ar_io_port}/ar-io/admin/export-parquet" \ | ||
-H "Authorization: Bearer $ADMIN_API_KEY" \ | ||
-H "Content-Type: application/json" \ | ||
-d "{ | ||
\"outputDir\": \"$parquet_dir\", | ||
\"startHeight\": $current_height, | ||
\"endHeight\": $end_height, | ||
\"maxFileRows\": $max_rows_per_file | ||
}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add error handling for the curl
command when exporting Parquet files
The curl
command used to initiate the export process does not check for HTTP errors. If the request fails, the script will proceed without detecting the failure. Consider adding the -f
flag to curl
and checking the exit status to handle errors appropriately.
Apply this diff to add error checking:
-curl -X POST "http://${ar_io_host}:${ar_io_port}/ar-io/admin/export-parquet" \
+if ! curl -f -X POST "http://${ar_io_host}:${ar_io_port}/ar-io/admin/export-parquet" \
-H "Authorization: Bearer $ADMIN_API_KEY" \
-H "Content-Type: application/json" \
-d "{
\"outputDir\": \"$parquet_dir\",
\"startHeight\": $current_height,
\"endHeight\": $end_height,
\"maxFileRows\": $max_rows_per_file
}"; then
+ echo "Export request failed"
+ exit 1
+fi
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
curl -X POST "http://${ar_io_host}:${ar_io_port}/ar-io/admin/export-parquet" \ | |
-H "Authorization: Bearer $ADMIN_API_KEY" \ | |
-H "Content-Type: application/json" \ | |
-d "{ | |
\"outputDir\": \"$parquet_dir\", | |
\"startHeight\": $current_height, | |
\"endHeight\": $end_height, | |
\"maxFileRows\": $max_rows_per_file | |
}" | |
if ! curl -f -X POST "http://${ar_io_host}:${ar_io_port}/ar-io/admin/export-parquet" \ | |
-H "Authorization: Bearer $ADMIN_API_KEY" \ | |
-H "Content-Type: application/json" \ | |
-d "{ | |
\"outputDir\": \"$parquet_dir\", | |
\"startHeight\": $current_height, | |
\"endHeight\": $end_height, | |
\"maxFileRows\": $max_rows_per_file | |
}"; then | |
echo "Export request failed" | |
exit 1 | |
fi |
while true; do | ||
if ! status=$(curl -s -f -H "Authorization: Bearer $ADMIN_API_KEY" "http://${ar_io_host}:${ar_io_port}/ar-io/admin/export-parquet/status"); then | ||
echo "Failed to get export status" | ||
rm -f "$parquet_dir"/*.parquet | ||
exit 1 | ||
fi | ||
|
||
export_status=$(echo "$status" | jq -r '.status') | ||
if [ "$export_status" = "completed" ]; then | ||
break | ||
elif [ "$export_status" = "errored" ]; then | ||
error=$(echo "$status" | jq -r '.error') | ||
echo "Export failed: $error" | ||
rm -f "$parquet_dir"/*.parquet | ||
exit 1 | ||
fi | ||
|
||
echo "Waiting for export to complete..." | ||
sleep 10 | ||
done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Implement a timeout to avoid infinite wait during export completion
The inner loop waiting for the export to complete lacks a timeout mechanism. If the export process hangs indefinitely, the script could enter an infinite loop. Introduce a maximum retry count to prevent this scenario.
Here's an example modification:
+# Set maximum number of retries (e.g., 30 retries with 10-second intervals = 5 minutes)
+max_retries=30
+retry_count=0
# Wait for the export to complete
while true; do
if ! status=$(curl -s -f -H "Authorization: Bearer $ADMIN_API_KEY" "http://${ar_io_host}:${ar_io_port}/ar-io/admin/export-parquet/status"); then
echo "Failed to get export status"
rm -f "$parquet_dir"/*.parquet
exit 1
fi
export_status=$(echo "$status" | jq -r '.status')
if [ "$export_status" = "completed" ]; then
break
elif [ "$export_status" = "errored" ]; then
error=$(echo "$status" | jq -r '.error')
echo "Export failed: $error"
rm -f "$parquet_dir"/*.parquet
exit 1
fi
echo "Waiting for export to complete..."
sleep 10
+ retry_count=$((retry_count + 1))
+ if [ "$retry_count" -ge "$max_retries" ]; then
+ echo "Export did not complete within expected time"
+ rm -f "$parquet_dir"/*.parquet
+ exit 1
+ fi
done
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
while true; do | |
if ! status=$(curl -s -f -H "Authorization: Bearer $ADMIN_API_KEY" "http://${ar_io_host}:${ar_io_port}/ar-io/admin/export-parquet/status"); then | |
echo "Failed to get export status" | |
rm -f "$parquet_dir"/*.parquet | |
exit 1 | |
fi | |
export_status=$(echo "$status" | jq -r '.status') | |
if [ "$export_status" = "completed" ]; then | |
break | |
elif [ "$export_status" = "errored" ]; then | |
error=$(echo "$status" | jq -r '.error') | |
echo "Export failed: $error" | |
rm -f "$parquet_dir"/*.parquet | |
exit 1 | |
fi | |
echo "Waiting for export to complete..." | |
sleep 10 | |
done | |
# Set maximum number of retries (e.g., 30 retries with 10-second intervals = 5 minutes) | |
max_retries=30 | |
retry_count=0 | |
# Wait for the export to complete | |
while true; do | |
if ! status=$(curl -s -f -H "Authorization: Bearer $ADMIN_API_KEY" "http://${ar_io_host}:${ar_io_port}/ar-io/admin/export-parquet/status"); then | |
echo "Failed to get export status" | |
rm -f "$parquet_dir"/*.parquet | |
exit 1 | |
fi | |
export_status=$(echo "$status" | jq -r '.status') | |
if [ "$export_status" = "completed" ]; then | |
break | |
elif [ "$export_status" = "errored" ]; then | |
error=$(echo "$status" | jq -r '.error') | |
echo "Export failed: $error" | |
rm -f "$parquet_dir"/*.parquet | |
exit 1 | |
fi | |
echo "Waiting for export to complete..." | |
sleep 10 | |
retry_count=$((retry_count + 1)) | |
if [ "$retry_count" -ge "$max_retries" ]; then | |
echo "Export did not complete within expected time" | |
rm -f "$parquet_dir"/*.parquet | |
exit 1 | |
fi | |
done |
No description provided.