Python source

Source code for llms_txt Python module, containing helpers to create and use llms.txt files

Introduction

The llms.txt file spec is for files located in the path llms.txt of a website (or, optionally, in a subpath). llms-sample.txt is a simple example. A file following the spec contains the following sections as markdown, in the specific order:

  • An H1 with the name of the project or site. This is the only required section
  • A blockquote with a short summary of the project, containing key information necessary for understanding the rest of the file
  • Zero or more markdown sections (e.g. paragraphs, lists, etc) of any type, except headings, containing more detailed information about the project and how to interpret the provided files
  • Zero or more markdown sections delimited by H2 headers, containing “file lists” of URLs where further detail is available
    • Each “file list” is a markdown list, containing a required markdown hyperlink [name](url), then optionally a : and notes about the file.

Here’s the start of a sample llms.txt file we’ll use for testing:

samp = Path('llms-sample.txt').read_text()
print(samp[:480])
# FastHTML

> FastHTML is a python library which brings together Starlette, Uvicorn, HTMX, and fastcore's `FT` "FastTags" into a library for creating server-rendered hypermedia applications.

Remember:

- Use `serve()` for running uvicorn (`if __name__ == "__main__"` is not needed since it's automatic)
- When a title is needed with a response, use `Titled`; note that that already wraps children in `Container`, and already includes both the meta title as well as the H1 element

Reading

We’ll implement parse_llms_file to pull out the sections of llms.txt into a simple data structure.


named_re


def named_re(
    nm, pat
):

Pattern to match pat in a named capture group


opt_re


def opt_re(
    s
):

Pattern to optionally match s

We’ll work “outside in” so we can test the innermost matches as we go.

Parse sections

sections = '''First bit.

## S1

-[foo](http://foo)
- [foo2](http://foo2): stuff

## S2

- [foo3](http://foo3)'''
start,*rest = re.split(fr'^##\s*(.*?$)', sections, flags=re.MULTILINE)
start
'First bit.\n\n'
rest
['S1',
 '\n\n-[foo](http://foo)\n- [foo2](http://foo2): stuff\n\n',
 'S2',
 '\n\n- [foo3](http://foo3)']
d = dict(chunked(rest, 2))
d
{'S1': '\n\n-[foo](http://foo)\n- [foo2](http://foo2): stuff\n\n',
 'S2': '\n\n- [foo3](http://foo3)'}
links = d['S1']
links.strip()
'-[foo](http://foo)\n- [foo2](http://foo2): stuff'
_parse_links(links)
[{'title': 'foo', 'url': 'http://foo', 'desc': None},
 {'title': 'foo2', 'url': 'http://foo2', 'desc': 'stuff'}]
start, sects = _parse_llms(samp)
start
'# FastHTML\n\n> FastHTML is a python library which brings together Starlette, Uvicorn, HTMX, and fastcore\'s `FT` "FastTags" into a library for creating server-rendered hypermedia applications.\n\nRemember:\n\n- Use `serve()` for running uvicorn (`if __name__ == "__main__"` is not needed since it\'s automatic)\n- When a title is needed with a response, use `Titled`; note that that already wraps children in `Container`, and already includes both the meta title as well as the H1 element.'
title = named_re('title', r'.+?$')
summ = named_re('summary', '.+?$')
summ_pat = opt_re(fr"^>\s*{summ}$")
info = named_re('info', '.*')
pat = fr'^#\s*{title}\n+{summ_pat}\n+{info}'
search(pat, start, (re.MULTILINE|re.DOTALL))
{'title': 'FastHTML',
 'summary': 'FastHTML is a python library which brings together Starlette, Uvicorn, HTMX, and fastcore\'s `FT` "FastTags" into a library for creating server-rendered hypermedia applications.',
 'info': 'Remember:\n\n- Use `serve()` for running uvicorn (`if __name__ == "__main__"` is not needed since it\'s automatic)\n- When a title is needed with a response, use `Titled`; note that that already wraps children in `Container`, and already includes both the meta title as well as the H1 element.'}

parse_llms_file


def parse_llms_file(
    txt
):

Parse llms.txt file contents in txt to an AttrDict

llmsd = parse_llms_file(samp)
llmsd.summary
'FastHTML is a python library which brings together Starlette, Uvicorn, HTMX, and fastcore\'s `FT` "FastTags" into a library for creating server-rendered hypermedia applications.'
llmsd.sections.Examples
[{'title': 'Todo list application', 'url': 'https://raw.githubusercontent.com/AnswerDotAI/fasthtml/main/examples/adv_app.py', 'desc': 'Detailed walk-thru of a complete CRUD app in FastHTML showing idiomatic use of FastHTML and HTMX patterns.'}]

XML conversion

For some LLMs such as Claude, XML format is preferred, so we’ll provide a function to create that format.


get_doc_content


def get_doc_content(
    url
):

Fetch content from local file if in nbdev repo.


mk_ctx


def mk_ctx(
    d, optional:bool=True, n_workers:NoneType=None
):

Create a Project with a Section for each H2 part in d, optionally skipping the ‘optional’ section.

ctx = mk_ctx(llmsd)
print(to_xml(ctx, do_escape=False)[:260]+'...')
{'title': 'FastHTML quick start', 'url': 'https://fastht.ml/docs/tutorials/quickstart_for_web_devs.html.md', 'desc': 'A brief overview of FastHTML features'}{'title': 'HTMX reference', 'url': 'https://raw.githubusercontent.com/bigskysoftware/htmx/master/www/content/reference.md', 'desc': 'Brief description of all HTMX attributes, CSS classes, headers, events, extensions, js lib methods, and config options'}

{'title': 'Starlette quick guide', 'url': 'https://gist.githubusercontent.com/jph00/e91192e9bdc1640f5421ce3c904f2efb/raw/61a2774912414029edaf1a55b506f0e283b93c46/starlette-quick.md', 'desc': {}}
{'title': 'Todo list application', 'url': 'https://raw.githubusercontent.com/AnswerDotAI/fasthtml/main/examples/adv_app.py', 'desc': 'Detailed walk-thru of a complete CRUD app in FastHTML showing idiomatic use of FastHTML and HTMX patterns.'}
{'title': 'Starlette full documentation', 'url': 'https://gist.githubusercontent.com/jph00/809e4a4808d4510be0e3dc9565e9cbd3/raw/9b717589ca44cedc8aaf00b2b8cacef922964c0f/starlette-sml.md', 'desc': 'A subset of the Starlette documentation useful for FastHTML development.'}
<project title="FastHTML" summary='FastHTML is a python library which brings together Starlette, Uvicorn, HTMX, and fastcore&#39;s `FT` "FastTags" into a library for creating server-rendered hypermedia applications.'>Remember:

- Use `serve()` for running uvic...

get_sizes


def get_sizes(
    ctx
):

Get the size of each section of the LLM context

get_sizes(ctx)
{'docs': {'FastHTML quick start': 35321,
  'HTMX reference': 28365,
  'Starlette quick guide': 7936},
 'examples': {'Todo list application': 16032},
 'optional': {'Starlette full documentation': 48331}}
Path('../fasthtml.md').write_text(to_xml(ctx, do_escape=False))
137125

create_ctx


def create_ctx(
    txt, optional:bool=False, n_workers:NoneType=None
):

A Project with a Section for each H2 part in txt, optionally skipping the ‘optional’ section.


llms_txt2ctx


def llms_txt2ctx(
    fname:str, # File name to read
    optional:bool_arg=False, # Include 'optional' section?
    n_workers:int=None, # Number of threads to use for parallel downloading
    save_nbdev_fname:str=None, # save output to nbdev `{docs_path}` instead of emitting to stdout
):

Print a Project with a Section for each H2 part in file read from fname, optionally skipping the ‘optional’ section.

!llms_txt2ctx llms-sample.txt > ../fasthtml.md