-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sourcery Starbot ⭐ refactored sidphbot/Auto-Research #3
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Due to GitHub API limits, only the first 60 comments can be shown.
st.sidebar.image(Image.open('logo_landscape.png'), use_column_width = 'always') | ||
st.title('Auto-Research') | ||
st.write('#### A no-code utility to generate a detailed well-cited survey with topic clustered sections' | ||
'(draft paper format) and other interesting artifacts from a single research query or a curated set of papers(arxiv ids).') | ||
st.write('##### Data Provider: arXiv Open Archive Initiative OAI') | ||
st.write('##### GitHub: https://github.com/sidphbot/Auto-Research') | ||
download_placeholder = st.container() | ||
|
||
with st.sidebar.form(key="survey_keywords_form"): | ||
session_data = sp.pydantic_input(key="keywords_input_model", model=KeywordsModel) | ||
st.write('or') | ||
session_data.update(sp.pydantic_input(key="arxiv_ids_input_model", model=ArxivIDsModel)) | ||
submit = st.form_submit_button(label="Submit") | ||
st.sidebar.write('#### execution log:') | ||
|
||
run_kwargs = {'surveyor':get_surveyor_instance(_print_fn=st.sidebar.write, _survey_print_fn=st.write), | ||
'download_placeholder':download_placeholder} | ||
if submit: | ||
if session_data['research_keywords'] != '': | ||
run_kwargs.update({'research_keywords':session_data['research_keywords'], | ||
'max_search':session_data['max_search'], | ||
'num_papers':session_data['num_papers']}) | ||
elif session_data['arxiv_ids'] != '': | ||
run_kwargs.update({'arxiv_ids':[id.strip() for id in session_data['arxiv_ids'].split(',')]}) | ||
|
||
run_survey(**run_kwargs) | ||
st.sidebar.image(Image.open('logo_landscape.png'), use_column_width = 'always') | ||
st.title('Auto-Research') | ||
st.write('#### A no-code utility to generate a detailed well-cited survey with topic clustered sections' | ||
'(draft paper format) and other interesting artifacts from a single research query or a curated set of papers(arxiv ids).') | ||
st.write('##### Data Provider: arXiv Open Archive Initiative OAI') | ||
st.write('##### GitHub: https://github.com/sidphbot/Auto-Research') | ||
download_placeholder = st.container() | ||
|
||
with st.sidebar.form(key="survey_keywords_form"): | ||
session_data = sp.pydantic_input(key="keywords_input_model", model=KeywordsModel) | ||
st.write('or') | ||
session_data.update(sp.pydantic_input(key="arxiv_ids_input_model", model=ArxivIDsModel)) | ||
submit = st.form_submit_button(label="Submit") | ||
st.sidebar.write('#### execution log:') | ||
|
||
run_kwargs = {'surveyor':get_surveyor_instance(_print_fn=st.sidebar.write, _survey_print_fn=st.write), | ||
'download_placeholder':download_placeholder} | ||
if submit: | ||
if session_data['research_keywords'] != '': | ||
run_kwargs.update({'research_keywords':session_data['research_keywords'], | ||
'max_search':session_data['max_search'], | ||
'num_papers':session_data['num_papers']}) | ||
elif session_data['arxiv_ids'] != '': | ||
run_kwargs['arxiv_ids'] = [ | ||
id.strip() for id in session_data['arxiv_ids'].split(',') | ||
] | ||
|
||
run_survey(**run_kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lines 76-101
refactored with the following changes:
- Add single value to dictionary directly rather than using update() (
simplify-dictionary-update
)
s = '{} {}'.format(match.group(2), match.group(3)) | ||
s = f'{match.group(2)} {match.group(3)}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function _parse_author_affil_split
refactored with the following changes:
- Replace call to format with f-string (
use-fstring-for-formatting
)
else: | ||
parts.append(pt) | ||
last = pt | ||
parts.append(pt) | ||
last = pt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function _remove_double_commas
refactored with the following changes:
- Remove unnecessary else after guard condition (
remove-unnecessary-else
)
def _collaboration_at_start(names: List[str]) \ | ||
-> Tuple[List[str], List[List[str]], int]: | ||
def _collaboration_at_start(names: List[str]) -> Tuple[List[str], List[List[str]], int]: | ||
"""Perform special handling of collaboration at start.""" | ||
author_list = [] | ||
|
||
back_propagate_affiliations_to = 0 | ||
while len(names) > 0: | ||
while names: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function _collaboration_at_start
refactored with the following changes:
- Simplify sequence length comparison (
simplify-len-comparison
) - Replace multiple comparisons of same variable with
in
operator (merge-comparisons
)
def _enum_collaboration_at_end(author_line: str)->Dict: | ||
def _enum_collaboration_at_end(author_line: str) -> Dict: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function _enum_collaboration_at_end
refactored with the following changes:
- Use named expression to simplify assignment and conditional (
use-named-expression
)
This removes the following comments ( why? ):
# Now expect `1) affil1 ', discard if no match
log.info('Searching "{}"...'.format(globber)) | ||
log.info('Found: {} pdfs'.format(len(pdffiles))) | ||
log.info(f'Searching "{globber}"...') | ||
log.info(f'Found: {len(pdffiles)} pdfs') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function convert_directory
refactored with the following changes:
- Replace call to format with f-string [×3] (
use-fstring-for-formatting
)
log.info('Searching "{}"...'.format(globber)) | ||
log.info('Found: {} pdfs'.format(len(pdffiles))) | ||
log.info(f'Searching "{globber}"...') | ||
log.info(f'Found: {len(pdffiles)} pdfs') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function convert_directory_parallel
refactored with the following changes:
- Replace call to format with f-string [×2] (
use-fstring-for-formatting
)
log.error('File conversion failed for {}: {}'.format(pdffile, e)) | ||
log.error(f'File conversion failed for {pdffile}: {e}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function convert_safe
refactored with the following changes:
- Replace call to format with f-string (
use-fstring-for-formatting
)
raise RuntimeError('No such path: %s' % path) | ||
raise RuntimeError(f'No such path: {path}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function convert
refactored with the following changes:
- Replace interpolated string formatting with f-string (
replace-interpolation-with-fstring
)
for f in files: | ||
if 'txt' in f: | ||
out.append(os.path.join(root, f)) | ||
|
||
out.extend(os.path.join(root, f) for f in files if 'txt' in f) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function all_articles
refactored with the following changes:
- Replace a for append loop with list extend (
for-append-to-extend
)
log.info('Completed {} articles'.format(i)) | ||
log.info(f'Completed {i} articles') | ||
try: | ||
refs = extract_references(article) | ||
cites[path_to_id(article)] = refs | ||
except: | ||
log.error("Error in {}".format(article)) | ||
log.error(f"Error in {article}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function citation_list_inner
refactored with the following changes:
- Replace call to format with f-string [×2] (
use-fstring-for-formatting
)
log.info('Calculating citation network for {} articles'.format(len(articles))) | ||
log.info(f'Calculating citation network for {len(articles)} articles') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function citation_list_parallel
refactored with the following changes:
- Replace call to format with f-string (
use-fstring-for-formatting
)
log.info('Saving to "{}"'.format(filename)) | ||
log.info(f'Saving to "{filename}"') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function save_to_default_location
refactored with the following changes:
- Replace call to format with f-string (
use-fstring-for-formatting
)
if response.status_code == 503: | ||
secs = int(response.headers.get('Retry-After', 20)) * 1.5 | ||
log.info('Requested to wait, waiting {} seconds until retry...'.format(secs)) | ||
|
||
time.sleep(secs) | ||
return get_list_record_chunk(resumptionToken=resumptionToken) | ||
else: | ||
if response.status_code != 503: | ||
raise Exception( | ||
'Unknown error in HTTP request {}, status code: {}'.format( | ||
response.url, response.status_code | ||
) | ||
f'Unknown error in HTTP request {response.url}, status code: {response.status_code}' | ||
) | ||
secs = int(response.headers.get('Retry-After', 20)) * 1.5 | ||
log.info(f'Requested to wait, waiting {secs} seconds until retry...') | ||
|
||
time.sleep(secs) | ||
return get_list_record_chunk(resumptionToken=resumptionToken) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function get_list_record_chunk
refactored with the following changes:
- Swap if/else branches (
swap-if-else-branches
) - Remove unnecessary else after guard condition (
remove-unnecessary-else
) - Replace call to format with f-string [×2] (
use-fstring-for-formatting
)
item = elm.find('arXiv:{}'.format(name), OAI_XML_NAMESPACES) | ||
item = elm.find(f'arXiv:{name}', OAI_XML_NAMESPACES) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function _record_element_text
refactored with the following changes:
- Replace call to format with f-string (
use-fstring-for-formatting
)
logger.info('Requesting "{}" (costs money!)'.format(filename)) | ||
logger.info(f'Requesting "{filename}" (costs money!)') | ||
request = requests.get(url, stream=True) | ||
response_iter = request.iter_content(chunk_size=chunk_size) | ||
logger.info("\t Writing {}".format(outfile)) | ||
logger.info(f"\t Writing {outfile}") | ||
with gzip.open(outfile, 'wb') as fout: | ||
for i, chunk in enumerate(response_iter): | ||
for chunk in response_iter: | ||
fout.write(chunk) | ||
md5.update(chunk) | ||
else: | ||
logger.info('Requesting "{}" (free!)'.format(filename)) | ||
logger.info("\t Writing {}".format(outfile)) | ||
logger.info(f'Requesting "{filename}" (free!)') | ||
logger.info(f"\t Writing {outfile}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function download_file
refactored with the following changes:
- Replace call to format with f-string [×4] (
use-fstring-for-formatting
) - Remove unnecessary calls to
enumerate
when the index is not used (remove-unused-enumerate
)
return os.path.join(DIR_PDFTARS, os.path.basename(filename)) + '.gz' | ||
return f'{os.path.join(DIR_PDFTARS, os.path.basename(filename))}.gz' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function _tar_to_filename
refactored with the following changes:
- Use f-string instead of string concatenation (
use-fstring-for-concatenation
)
msg = "MD5 '{}' does not match expected '{}' for file '{}'".format( | ||
md5_downloaded, md5_expected, filename | ||
) | ||
msg = f"MD5 '{md5_downloaded}' does not match expected '{md5_expected}' for file '{filename}'" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function download_check_tarfile
refactored with the following changes:
- Replace call to format with f-string (
use-fstring-for-formatting
)
if dryrun: | ||
logger.info(cmd) | ||
return 0 | ||
else: | ||
if not dryrun: | ||
return subprocess.check_call( | ||
shlex.split(cmd), stderr=None if debug else open(os.devnull, 'w') | ||
) | ||
logger.info(cmd) | ||
return 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function call
refactored with the following changes:
- Swap if/else branches (
swap-if-else-branches
) - Remove unnecessary else after guard condition (
remove-unnecessary-else
)
msg = 'Tarfile from manifest not found {}, skipping...'.format(outname) | ||
msg = f'Tarfile from manifest not found {outname}, skipping...' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function process_tarfile_inner
refactored with the following changes:
- Replace call to format with f-string [×5] (
use-fstring-for-formatting
)
logger.info('Tar file appears processed, skipping {}...'.format(filename)) | ||
logger.info(f'Tar file appears processed, skipping {filename}...') | ||
return | ||
|
||
logger.info('Processing tar "{}" ...'.format(filename)) | ||
logger.info(f'Processing tar "{filename}" ...') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function process_tarfile
refactored with the following changes:
- Replace call to format with f-string [×2] (
use-fstring-for-formatting
)
logger.info("Indexing {}...".format(name)) | ||
logger.info(f"Indexing {name}...") | ||
|
||
tarname = os.path.join(DIR_PDFTARS, os.path.basename(name))+'.gz' | ||
tarname = f'{os.path.join(DIR_PDFTARS, os.path.basename(name))}.gz' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function generate_tarfile_indices
refactored with the following changes:
- Replace call to format with f-string (
use-fstring-for-formatting
) - Use f-string instead of string concatenation (
use-fstring-for-concatenation
)
logger.info("Checking {}...".format(tar)) | ||
logger.info(f"Checking {tar}...") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function check_missing_txt_files
refactored with the following changes:
- Replace call to format with f-string (
use-fstring-for-formatting
)
sort = list(reversed( | ||
sorted([(k, v) for k, v in missing.items()], key=lambda x: len(x[1])) | ||
)) | ||
sort = list(reversed(sorted(list(missing.items()), key=lambda x: len(x[1])))) | ||
|
||
for tar, names in sort: | ||
logger.info("Running {} ({} to do)...".format(tar, len(names))) | ||
logger.info(f"Running {tar} ({len(names)} to do)...") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function rerun_missing
refactored with the following changes:
- Replace identity comprehension with call to collection constructor (
identity-comprehension
) - Replace call to format with f-string (
use-fstring-for-formatting
)
return '{}/{}.pdf'.format(ym, n) | ||
return f'{ym}/{n}.pdf' | ||
else: | ||
ym = n.split('/')[1][:4] | ||
return '{}/{}.pdf'.format(ym, n.replace('/', '')) | ||
return f"{ym}/{n.replace('/', '')}.pdf" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function id_to_tarpdf
refactored with the following changes:
- Replace call to format with f-string [×2] (
use-fstring-for-formatting
)
joblib.dump(papers, dump_dir + 'papers_selected_pdf_route.dmp') | ||
joblib.dump(papers, f'{dump_dir}papers_selected_pdf_route.dmp') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function Surveyor.fetch_papers
refactored with the following changes:
- Use f-string instead of string concatenation (
use-fstring-for-concatenation
)
file.write(research_sections['conclusion']) | ||
self.survey_print_fn(research_sections['conclusion']) | ||
file.write("") | ||
self.survey_print_fn("") | ||
|
||
file.write('REFERENCES') | ||
self.survey_print_fn('REFERENCES') | ||
self.survey_print_fn("=================================================") | ||
file.write("=================================================") | ||
file.write("") | ||
self.survey_print_fn("") | ||
for entry in bibentries: | ||
file.write(entry) | ||
self.survey_print_fn(entry) | ||
with open(filename, 'w+') as file: | ||
if query is None: | ||
query = 'Internal(existing) research' | ||
self.survey_print_fn("#### Generated_survey:") | ||
file.write("----------------------------------------------------------------------") | ||
file.write(f"Title: A survey on {query}") | ||
self.survey_print_fn("") | ||
self.survey_print_fn("----------------------------------------------------------------------") | ||
self.survey_print_fn(f"Title: A survey on {query}") | ||
file.write("Author: Auto-Research (github.com/sidphbot/Auto-Research)") | ||
self.survey_print_fn("Author: Auto-Research (github.com/sidphbot/Auto-Research)") | ||
file.write("Dev: Auto-Research (github.com/sidphbot/Auto-Research)") | ||
self.survey_print_fn("Dev: Auto-Research (github.com/sidphbot/Auto-Research)") | ||
file.write("Disclaimer: This survey is intended to be a research starter. This Survey is Machine-Summarized, "+ | ||
"\nhence some sentences might be wrangled or grammatically incorrect. However all sentences are "+ | ||
"\nmined with proper citations. As All of the text is practically quoted texted, hence to "+ | ||
"\nimprove visibility, all the papers are duly cited in the Bibiliography section. as bibtex "+ | ||
"\nentries(only to avoid LaTex overhead). ") | ||
self.survey_print_fn("Disclaimer: This survey is intended to be a research starter. This Survey is Machine-Summarized, "+ | ||
"\nhence some sentences might be wrangled or grammatically incorrect. However all sentences are "+ | ||
"\nmined with proper citations. As All of the text is practically quoted texted, hence to "+ | ||
"\nimprove visibility, all the papers are duly cited in the Bibiliography section. as bibtex "+ | ||
"\nentries(only to avoid LaTex overhead). ") | ||
file.write("----------------------------------------------------------------------") | ||
self.survey_print_fn("----------------------------------------------------------------------") | ||
file.write("") | ||
self.survey_print_fn("") | ||
file.write('ABSTRACT') | ||
self.survey_print_fn('ABSTRACT') | ||
self.survey_print_fn("=================================================") | ||
file.write("=================================================") | ||
file.write("") | ||
self.survey_print_fn("") | ||
file.write(research_sections['abstract']) | ||
self.survey_print_fn(research_sections['abstract']) | ||
file.write("") | ||
self.survey_print_fn("") | ||
file.write('INTRODUCTION') | ||
self.survey_print_fn('INTRODUCTION') | ||
self.survey_print_fn("=================================================") | ||
file.write("=================================================") | ||
file.write("") | ||
self.survey_print_fn("") | ||
file.write(research_sections['introduction']) | ||
self.survey_print_fn(research_sections['introduction']) | ||
file.write("") | ||
self.survey_print_fn("") | ||
for k, v in research_sections.items(): | ||
if k not in ['abstract', 'introduction', 'conclusion']: | ||
file.write(k.upper()) | ||
self.survey_print_fn(k.upper()) | ||
self.survey_print_fn("=================================================") | ||
file.write("=================================================") | ||
file.write("") | ||
self.survey_print_fn("") | ||
file.write(v) | ||
self.survey_print_fn(v) | ||
file.write("") | ||
self.survey_print_fn("") | ||
file.write('CONCLUSION') | ||
self.survey_print_fn('CONCLUSION') | ||
self.survey_print_fn("=================================================") | ||
file.write("=================================================") | ||
file.write("") | ||
self.survey_print_fn("") | ||
file.write(research_sections['conclusion']) | ||
self.survey_print_fn(research_sections['conclusion']) | ||
file.write("") | ||
self.survey_print_fn("") | ||
self.survey_print_fn("========================XXX=========================") | ||
file.write("========================XXX=========================") | ||
file.close() | ||
|
||
file.write('REFERENCES') | ||
self.survey_print_fn('REFERENCES') | ||
self.survey_print_fn("=================================================") | ||
file.write("=================================================") | ||
file.write("") | ||
self.survey_print_fn("") | ||
for entry in bibentries: | ||
file.write(entry) | ||
self.survey_print_fn(entry) | ||
file.write("") | ||
self.survey_print_fn("") | ||
self.survey_print_fn("========================XXX=========================") | ||
file.write("========================XXX=========================") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function Surveyor.build_doc
refactored with the following changes:
- Use
with
when opening file to ensure closure (ensure-file-closed
) - Use f-string instead of string concatenation [×2] (
use-fstring-for-concatenation
)
res = set([str(sent) for sent in list(res.sents)]) | ||
summtext = ''.join([line for line in res]) | ||
res = {str(sent) for sent in list(res.sents)} | ||
summtext = ''.join(list(res)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function Surveyor.build_basic_blocks
refactored with the following changes:
- Replace list(), dict() or set() with comprehension (
collection-builtin-to-comprehension
) - Replace unneeded comprehension with generator (
comprehension-to-generator
) - Replace identity comprehension with call to collection constructor (
identity-comprehension
)
res = set([str(sent) for sent in list(res.sents)]) | ||
summtext = ''.join([line for line in res]) | ||
#self.print_fn("abstractive summary type:" + str(type(summary))) | ||
return summtext | ||
res = {str(sent) for sent in list(res.sents)} | ||
return ''.join(list(res)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function Surveyor.abstractive_summary
refactored with the following changes:
- Replace list(), dict() or set() with comprehension (
collection-builtin-to-comprehension
) - Replace unneeded comprehension with generator (
comprehension-to-generator
) - Inline variable that is immediately returned (
inline-immediately-returned-variable
) - Replace identity comprehension with call to collection constructor (
identity-comprehension
)
This removes the following comments ( why? ):
#self.print_fn("abstractive summary type:" + str(type(summary)))
abstext = k + '. ' + v.replace('\n', ' ') | ||
abstext = f'{k}. ' + v.replace('\n', ' ') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function Surveyor.get_corpus_lines
refactored with the following changes:
- Use f-string instead of string concatenation (
use-fstring-for-concatenation
)
Thanks for starring sourcery-ai/sourcery ✨ 🌟 ✨
Here's your pull request refactoring your most popular Python repo.
If you want Sourcery to refactor all your Python repos and incoming pull requests install our bot.
Review changes via command line
To manually merge these changes, make sure you're on the
main
branch, then run: